关于ik分词器的实践问题

来源：1-1 课程导学

cloverxixi

2022-12-12

老师您好，现在我有这样一个索引，map包含了一个name属性，设置如下：

  "name": {
          "type": "text",
          "analyzer": "ik_max_word",
          "search_analyzer": "ik_smart"
  },

其中有一个文档的name值为：上海七剑投资管理有限公司
通过ik_max_word分词之后的结果如下：

{
  "tokens" : [
    {
      "token" : "上海",
      "start_offset" : 0,
      "end_offset" : 2,
      "type" : "CN_WORD",
      "position" : 0
    },
    {
      "token" : "七",
      "start_offset" : 2,
      "end_offset" : 3,
      "type" : "TYPE_CNUM",
      "position" : 1
    },
    {
      "token" : "剑",
      "start_offset" : 3,
      "end_offset" : 4,
      "type" : "COUNT",
      "position" : 2
    },
    {
      "token" : "投资",
      "start_offset" : 4,
      "end_offset" : 6,
      "type" : "CN_WORD",
      "position" : 3
    },
    {
      "token" : "管理",
      "start_offset" : 6,
      "end_offset" : 8,
      "type" : "CN_WORD",
      "position" : 4
    },
    {
      "token" : "有限公司",
      "start_offset" : 8,
      "end_offset" : 12,
      "type" : "CN_WORD",
      "position" : 5
    },
    {
      "token" : "有限",
      "start_offset" : 8,
      "end_offset" : 10,
      "type" : "CN_WORD",
      "position" : 6
    },
    {
      "token" : "公司",
      "start_offset" : 10,
      "end_offset" : 12,
      "type" : "CN_WORD",
      "position" : 7
    }
  ]
}

通过ik_smart分词的结果如下：

{
  "tokens" : [
    {
      "token" : "上海",
      "start_offset" : 0,
      "end_offset" : 2,
      "type" : "CN_WORD",
      "position" : 0
    },
    {
      "token" : "七剑",
      "start_offset" : 2,
      "end_offset" : 4,
      "type" : "TYPE_CQUAN",
      "position" : 1
    },
    {
      "token" : "投资",
      "start_offset" : 4,
      "end_offset" : 6,
      "type" : "CN_WORD",
      "position" : 2
    },
    {
      "token" : "管理",
      "start_offset" : 6,
      "end_offset" : 8,
      "type" : "CN_WORD",
      "position" : 3
    },
    {
      "token" : "有限公司",
      "start_offset" : 8,
      "end_offset" : 12,
      "type" : "CN_WORD",
      "position" : 4
    }
  ]
}

其中“七剑”这两个字符在两次分词方式中的结果并不一样
当我在执行搜索的时候，关键词搜索“七剑”，使用ik_smart进行搜索，七剑会被拆分成“七剑”

{
  "tokens" : [
    {
      "token" : "七剑",
      "start_offset" : 0,
      "end_offset" : 2,
      "type" : "TYPE_CQUAN",
      "position" : 0
    }
  ]
}

而索引构建的时候是按照，“七”和“剑”进行构建的，导致在查询的时候并不会命中对应的文档，
这在业务上并不符合用户的认知
这种情况要如何处理呢，跪求老师解答

写回答

1回答