分词器使用的两个情形:
1,Index time analysis. 创建或者更新文档时,会对文档进行分词
2,Search time analysis. 查询时,对查询语句分词
一、指定查询时使用哪个分词器的方式有:
1、查询时通过analyzer指定分词器
GET test_index/_search{"query": {"match": {"name": {"query": "lin","analyzer": "standard"}}}}
2、创建index mapping时指定search_analyzer
#创建索引时,如果不指定分词时,会使用默认的standardPUT test_index{"mappings": {"doc": {"properties": {"title":{"type": "text","analyzer": "whitespace", #指定写入时的分词器,es内置有多种analyzer"search_analyzer": "standard" #指定搜索时的分词器}}}}}
二、分词器API
1、查看分词结果
POST _analyze{"analyzer": "standard","text": "我是中国人"}
2、指定索引的自动查看分词结果
POST /indexName/_analyze{"analyzer": "standard","field":"hobby", #指定具体的字段名称"text": "我是中国人"}
3、修改分词器的设定
#启用english停用词token filter,默认停词器是禁用的PUT /my_index{"settings": {"analysis": {"analyzer": {"es_std": { #分词器名称,"type": "standard","stopwords": "_english_"}}}}}
三、ES分词器组成
分词器的组成:
1、character filter:在一段文本进行分词之前,先进行预处理,比如说最常见的就是,过滤html标签(hello —> hello),& —> and(I&you —> I and you)
2、tokenizer:按照一定的规则进行分词,hello you and me —> hello, you, and, me
3、token filter:对tokenizer输出的词进行增加(添加近义词)、修改、删除,如lowercase,stop word,synonymom,dogs —> dog,liked —> like,Tom —> tom,a/the/an —> 干掉,mother —> mom,small —> little
stop word 停用词: 了 的 呢。
一个分词器,很重要,将一段文本进行各种处理,最后处理好的结果才会拿去建立倒排索引。
内置的分词器:
1、standard 由以下组成
- tokenizer:Standard Tokenizer
- token filter:Standard Token Filter,Lower Case Token Filter,Stop Token Filter
2、whitespace 空格为分隔符
3、simple
4、stop 默认stopwords用english
5、keyword 不分词的
6、language 以特定的语言进行分词,如english
四、第三方分词器
es内置很多分词器,但是对中文分词并不友好,例如使用standard分词器对一句中文话进行分词,会分成一个字一个字的。这时可以使用第三方的Analyzer插件,比如 ik、jieba、pinyin等。
安装只需要把对应的插件下载解压到es的plugins目录下即可再重启生效。
如ik分词器的效果如下:
GET _analyze{"analyzer": "ik_max_word", #ik分词器的名称"text": "你好吗?我有一句话要对你说呀。"}结果:{"tokens": [{"token": "你好","start_offset": 0,"end_offset": 2,"type": "CN_WORD","position": 0},{"token": "好吗","start_offset": 1,"end_offset": 3,"type": "CN_WORD","position": 1},{"token": "我","start_offset": 4,"end_offset": 5,"type": "CN_CHAR","position": 2},{"token": "有","start_offset": 5,"end_offset": 6,"type": "CN_CHAR","position": 3},{"token": "一句话","start_offset": 6,"end_offset": 9,"type": "CN_WORD","position": 4},{"token": "一句","start_offset": 6,"end_offset": 8,"type": "CN_WORD","position": 5},{"token": "一","start_offset": 6,"end_offset": 7,"type": "TYPE_CNUM","position": 6},{"token": "句话","start_offset": 7,"end_offset": 9,"type": "CN_WORD","position": 7},{"token": "句","start_offset": 7,"end_offset": 8,"type": "COUNT","position": 8},{"token": "话","start_offset": 8,"end_offset": 9,"type": "CN_CHAR","position": 9},{"token": "要对","start_offset": 9,"end_offset": 11,"type": "CN_WORD","position": 10},{"token": "你","start_offset": 11,"end_offset": 12,"type": "CN_CHAR","position": 11},{"token": "说呀","start_offset": 12,"end_offset": 14,"type": "CN_WORD","position": 12}]}
五、自定义分词器
PUT /my_index{"settings": {"analysis": {"char_filter": { #定义分词第一步的filter,预处理"&_to_and": {"type": "mapping","mappings": ["&=> and"] # 把&符号转换为and字符}},"filter": { # 定义第三步的filter,过滤器把the和a的字符去除"my_stopwords": {"type": "stop","stopwords": ["the", "a"]}},"analyzer": {"my_analyzer": { #第二步分词器"type": "custom","char_filter": ["html_strip", "&_to_and"], #使用哪些预处理器,有一个es自带的和一个自定义的"tokenizer": "standard", # 使用默认的分词策略"filter": ["lowercase", "my_stopwords"] # 第三步,使用一个默认的和自定义的filter}}}}}
