02-核心概念与 ik 分词器 - 《ElasticSearch》

几个核心概念
IK分词器
- 安装
- 使用 kibana 测试

几个核心概念

文档：对应 RDB 中的记录

类型：很少用。可设置，也可不设置

索引：就是数据库

分片和副本

02-核心概念与 ik 分词器 - 图1

一个集群至少有一个节点，而一个节点就是一个 ES 进程，节点可以有多个索引默认的，如果你创建索引，那么 索引将会有个 5 个分片 ( primary shard，又称主分片 ) 构成的 ，每一个主分片会有一个副本 ( replica shard ，又称复制分片 )
02-核心概念与 ik 分词器 - 图3

倒排索引

了解原理与优势。

ES 索引与 Lucence 索引
在 ES 中，索引被分为多个分片，每份分片是一个 Lucene 的索引。所以一个 ES 索引是由多个 Lucene 索引组成的。

IK分词器

分词 : 即把一段中文或者别的划分成一个个的关键字，我们在搜索时候会把自己的信息进行分词，会把数据库中或者索引库中的数据进行分词，然后进行一个匹配操作，默认的中文分词是将每个字看成一个词，比如 ‘ 我爱狂神 “ 会被分为 “我 “ 爱 “” 狂 “” 神 ”, 这显然是不符合要求的，所以我们需要安装 中文分词器 ik 来解决这个问题。

如果使用中文，建议使用 ik 分词器！

安装

https://github.com/medcl/elasticsearch-analysis-ik

下载，放入 es 安装目录。

版本要与 ES 一致！
02-核心概念与 ik 分词器 - 图5

重启观察 sudo tail -f /var/log/elasticsearch/elasticsearch.log

02-核心概念与 ik 分词器 - 图8
或者：

使用 kibana 测试

最少切分 切分结果内容不重复
02-核心概念与 ik 分词器 - 图10

{
  "tokens" : [
    {
      "token" : "中国人民",
      "start_offset" : 0,
      "end_offset" : 4,
      "type" : "CN_WORD",
      "position" : 0
    },
    {
      "token" : "站起",
      "start_offset" : 4,
      "end_offset" : 6,
      "type" : "CN_WORD",
      "position" : 1
    },
    {
      "token" : "来了",
      "start_offset" : 6,
      "end_offset" : 8,
      "type" : "CN_WORD",
      "position" : 2
    }
  ]
}

最细粒度切分
02-核心概念与 ik 分词器 - 图11

{
  "tokens" : [
    {
      "token" : "中国人民",
      "start_offset" : 0,
      "end_offset" : 4,
      "type" : "CN_WORD",
      "position" : 0
    },
    {
      "token" : "中国人",
      "start_offset" : 0,
      "end_offset" : 3,
      "type" : "CN_WORD",
      "position" : 1
    },
    {
      "token" : "中国",
      "start_offset" : 0,
      "end_offset" : 2,
      "type" : "CN_WORD",
      "position" : 2
    },
    {
      "token" : "国人",
      "start_offset" : 1,
      "end_offset" : 3,
      "type" : "CN_WORD",
      "position" : 3
    },
    {
      "token" : "人民",
      "start_offset" : 2,
      "end_offset" : 4,
      "type" : "CN_WORD",
      "position" : 4
    },
    {
      "token" : "站起来",
      "start_offset" : 4,
      "end_offset" : 7,
      "type" : "CN_WORD",
      "position" : 5
    },
    {
      "token" : "站起",
      "start_offset" : 4,
      "end_offset" : 6,
      "type" : "CN_WORD",
      "position" : 6
    },
    {
      "token" : "起来",
      "start_offset" : 5,
      "end_offset" : 7,
      "type" : "CN_WORD",
      "position" : 7
    },
    {
      "token" : "来了",
      "start_offset" : 6,
      "end_offset" : 8,
      "type" : "CN_WORD",
      "position" : 8
    }
  ]
}

继续拆分一个复杂的：
02-核心概念与 ik 分词器 - 图12

{
  "tokens" : [
    {
      "token" : "老",
      "start_offset" : 0,
      "end_offset" : 1,
      "type" : "CN_CHAR",
      "position" : 0
    },
    {
      "token" : "八",
      "start_offset" : 1,
      "end_offset" : 2,
      "type" : "TYPE_CNUM",
      "position" : 1
    },
    {
      "token" : "蜜",
      "start_offset" : 2,
      "end_offset" : 3,
      "type" : "CN_CHAR",
      "position" : 2
    },
    {
      "token" : "制",
      "start_offset" : 3,
      "end_offset" : 4,
      "type" : "CN_CHAR",
      "position" : 3
    },
    {
      "token" : "小",
      "start_offset" : 4,
      "end_offset" : 5,
      "type" : "CN_CHAR",
      "position" : 4
    },
    {
      "token" : "汉堡",
      "start_offset" : 5,
      "end_offset" : 7,
      "type" : "CN_WORD",
      "position" : 5
    }
  ]
}

ES 不认识 “老八” 和 “小汉堡”，怎么办呢？加入到自定义的字典中！
02-核心概念与 ik 分词器 - 图13

02-核心概念与 ik 分词器 - 图15

重启 ES，再次尝试，成功！

{
  "tokens" : [
    {
      "token" : "老八",
      "start_offset" : 0,
      "end_offset" : 2,
      "type" : "CN_WORD",
      "position" : 0
    },
    {
      "token" : "蜜",
      "start_offset" : 2,
      "end_offset" : 3,
      "type" : "CN_CHAR",
      "position" : 1
    },
    {
      "token" : "制",
      "start_offset" : 3,
      "end_offset" : 4,
      "type" : "CN_CHAR",
      "position" : 2
    },
    {
      "token" : "小汉堡",
      "start_offset" : 4,
      "end_offset" : 7,
      "type" : "CN_WORD",
      "position" : 3
    }
  ]
}

以后类似场景，我们需要自己将分词放在自定义的 dic 文件中进行配置即可！