全文检索--Elasticsearch-分词 - 《谷粒商城微服务学习笔记》

#">一、分词概念#
- #">Character filters （字符过滤器）#
- #">Tokenizer （分词器）#
- #">Token filters （token过滤器）#
#">二、安装IK分词器#
- #">安装#
- #">中文测试#

一、分词概念#

一个 tokenizer（分词器）接收一个字符流，将之分割为独立的 tokens（词元，通常是独立的单词），然后输出 tokens 流。
无论是内置的分析器（analyzer），还是自定义的分析器（analyzer），都由三种构件块组成的：character filters ， tokenizers ， token filters。
内置的analyzer将这些构建块预先打包到适合不同语言和文本类型的analyzer中。

Character filters （字符过滤器）#

字符过滤器以字符流的形式接收原始文本，并可以通过添加、删除或更改字符来转换该流。
举例来说，一个字符过滤器可以用来把阿拉伯数字（٠‎١٢٣٤٥٦٧٨‎٩）‎转成成Arabic-Latin的等价物（0123456789）。
一个分析器可能有0个或多个字符过滤器，它们按顺序应用。
（PS：类似Servlet中的过滤器，或者拦截器，想象一下有一个过滤器链）

Tokenizer （分词器）#

一个分词器接收一个字符流，并将其拆分成单个token （通常是单个单词），并输出一个token流。例如，一个whitespace分词器当它看到空白的时候就会将文本拆分成token。它会将文本“Quick brown fox!”转换为[Quick, brown, fox!]
（PS：Tokenizer 负责将文本拆分成单个token ，这里token就指的就是一个一个的单词。就是一段文本被分割成好几部分，相当于Java中的字符串的 split ）
分词器还负责记录每个term的顺序或位置，以及该term所表示的原单词的开始和结束字符偏移量。（PS：文本被分词后的输出是一个term数组）
一个分析器必须只能有一个分词器

Token filters （token过滤器）#

token过滤器接收token流，并且可能会添加、删除或更改tokens。
例如，一个lowercase token filter可以将所有的token转成小写。stop token filter可以删除常用的单词，比如 the 。synonym token filter可以将同义词引入token流。
不允许token过滤器更改每个token的位置或字符偏移量。
一个分析器可能有0个或多个token过滤器，它们按顺序应用。

二、安装IK分词器#

由于官方带的分词器处理中文分词惨不忍睹，所以，为了解决中文分词的问题，咱们需要掌握至少一种中文分词器，常用的中文分词器有IK、jieba、THULAC等，推荐使用IK分词器，这也是目前使用最多的分词器，接下来咱们在docker环境下把IK分词器装一下。
全文检索--Elasticsearch-分词 - 图1

安装#

IK版本地址：
根据你安装的ES找到对应的IK版本：
https://github.com/medcl/elasticsearch-analysis-ik/releases
进入docker es容器内部 plugins 目录,由于我们在docker安装es的时候，plugins 和我们宿主机的文件夹/mydata/elasticsearch/plugins/做了映射，所以，不需要进入es内部操作，如果没有做关联的需要进入容器内部 plugins 目录。
进入es容器内部：

docker exec -it 容器ID /bin/bash

在宿主机操作：
下载压缩包zip

[root@localhost plugins]#cd  /mydata/elasticsearch/plugins
[root@localhost plugins]# yum install -y wget
[root@localhost plugins]# wget https://github.com/medcl/elasticsearch-analysis-ik/releases/download/v7.4.2/elasticsearch-analysis-ik-7.4.2.zip

解压：

yum install -y unzip  # 如果没有安装unzip命令需要先安装
unzip elasticsearch-analysis-ik-7.4.2.zip

删除原本的zip包:

[root@localhost plugins]# rm -rf *.zip

查看IK解压包：

[root@localhost plugins]# ls -l
total 1432
-rw-r--r--. 1 root root 263965 May  6  2018 commons-codec-1.9.jar
-rw-r--r--. 1 root root  61829 May  6  2018 commons-logging-1.2.jar
drwxr-xr-x. 2 root root   4096 Oct  7  2019 config
-rw-r--r--. 1 root root  54643 Nov  4  2019 elasticsearch-analysis-ik-7.4.2.jar
-rw-r--r--. 1 root root 736658 May  6  2018 httpclient-4.5.2.jar
-rw-r--r--. 1 root root 326724 May  6  2018 httpcore-4.4.4.jar
-rw-r--r--. 1 root root   1805 Nov  4  2019 plugin-descriptor.properties
-rw-r--r--. 1 root root    125 Nov  4  2019 plugin-security.policy

将解压的文件放入到 /plugins/ik 目录下

cd /mydata/elasticsearch
cp -r ./plugins ./plugins/ik

删除旧文件：

[root@localhost plugins]# ls -l
total 1432
-rw-r--r--. 1 root root 263965 May  6  2018 commons-codec-1.9.jar
-rw-r--r--. 1 root root  61829 May  6  2018 commons-logging-1.2.jar
drwxr-xr-x. 2 root root   4096 Oct  7  2019 config
-rw-r--r--. 1 root root  54643 Nov  4  2019 elasticsearch-analysis-ik-7.4.2.jar
-rw-r--r--. 1 root root 736658 May  6  2018 httpclient-4.5.2.jar
-rw-r--r--. 1 root root 326724 May  6  2018 httpcore-4.4.4.jar
drwxr-xr-x. 3 root root    243 Aug 24 00:21 ik
-rw-r--r--. 1 root root   1805 Nov  4  2019 plugin-descriptor.properties
-rw-r--r--. 1 root root    125 Nov  4  2019 plugin-security.policy
[root@localhost plugins]# rm -rf com*
[root@localhost plugins]# rm -rf http*
[root@localhost plugins]# rm -rf plu*
[root@localhost plugins]# rm -rf elasticsearch-analysis-ik-7.4.2.jar
[root@localhost plugins]# ls -l
total 4
drwxr-xr-x. 2 root root 4096 Oct  7  2019 config
drwxr-xr-x. 3 root root  243 Aug 24 00:21 ik
[root@localhost plugins]# rm -rf con*
[root@localhost plugins]# ls -l
total 0
drwxr-xr-x. 3 root root 243 Aug 24 00:21 ik
[root@localhost plugins]#

最终安装在plugins 的ik目录结构：

-- plugins
    `-- ik
        |-- commons-codec-1.9.jar
        |-- commons-logging-1.2.jar
        |-- config
        |   |-- IKAnalyzer.cfg.xml
        |   |-- extra_main.dic
        |   |-- extra_single_word.dic
        |   |-- extra_single_word_full.dic
        |   |-- extra_single_word_low_freq.dic
        |   |-- extra_stopword.dic
        |   |-- main.dic
        |   |-- preposition.dic
        |   |-- quantifier.dic
        |   |-- stopword.dic
        |   |-- suffix.dic
        |   `-- surname.dic
        |-- elasticsearch-analysis-ik-7.4.2.jar
        |-- httpclient-4.5.2.jar
        |-- httpcore-4.4.4.jar
        |-- plugin-descriptor.properties
        `-- plugin-security.policy

重启es容器：

docker restart b9c6

中文测试#

示例1：

POST _analyze
{
  "analyzer": "ik_smart", 
  "text":"我是一名靓仔"
}

结果1：

{
  "tokens" : [
    {
      "token" : "我",
      "start_offset" : 0,
      "end_offset" : 1,
      "type" : "CN_CHAR",
      "position" : 0
    },
    {
      "token" : "是",
      "start_offset" : 1,
      "end_offset" : 2,
      "type" : "CN_CHAR",
      "position" : 1
    },
    {
      "token" : "一名",
      "start_offset" : 2,
      "end_offset" : 4,
      "type" : "CN_WORD",
      "position" : 2
    },
    {
      "token" : "靓仔",
      "start_offset" : 4,
      "end_offset" : 6,
      "type" : "CN_WORD",
      "position" : 3
    }
  ]
}

示例2：

POST _analyze
{
  "analyzer": "ik_max_word", 
  "text":"我是一名靓仔"
}

结果2：

{
  "tokens" : [
    {
      "token" : "我",
      "start_offset" : 0,
      "end_offset" : 1,
      "type" : "CN_CHAR",
      "position" : 0
    },
    {
      "token" : "是",
      "start_offset" : 1,
      "end_offset" : 2,
      "type" : "CN_CHAR",
      "position" : 1
    },
    {
      "token" : "一名",
      "start_offset" : 2,
      "end_offset" : 4,
      "type" : "CN_WORD",
      "position" : 2
    },
    {
      "token" : "一",
      "start_offset" : 2,
      "end_offset" : 3,
      "type" : "TYPE_CNUM",
      "position" : 3
    },
    {
      "token" : "名",
      "start_offset" : 3,
      "end_offset" : 4,
      "type" : "COUNT",
      "position" : 4
    },
    {
      "token" : "靓仔",
      "start_offset" : 4,
      "end_offset" : 6,
      "type" : "CN_WORD",
      "position" : 5
    }
  ]
}