ElasticSearch - 分词器工作流程 - 《搜索引擎》

分词器工作流程
内置分词器的介绍
定制分词器
将特殊的字符串进行转换

分词器工作流程

1.切分词语,根据你指定好的分词器去进行切分
2.normalization ,就是让词语更加保准话正规,也就是说你切分的词语可能会有那些语气词,比如说呢的啊这些词不需要,那你ElasticSearch就得去掉这些语气词.还有就是大小写转换,同义词处理这些事情都是由normalization来处理.

给你一段句子，然后将这段句子拆分成一个一个的单个的单词，同时对每个单词进行normalization（时态转换，单复数转换），分词器
recall，召回率：搜索的时候，增加能够搜索到的结果的数量

分词器处理过程:

1.character filter：在一段文本进行分词之前，先进行预处理，比如说最常见的就是，过滤html标签（hello —> hello），& —> and（I&you —> I and you）
2.tokenizer：分词，hello you and me —> hello, you, and, me
3.token filter：lowercase，stop word(停用词处理)，synonymom(同义词处理)，liked —> like，Tom —> tom，a/the/an —> 干掉，small —> little

一个分词器，很重要，将一段文本进行各种处理，最后处理好的结果才会拿去建立倒排索引

内置分词器的介绍

Set the shape to semi-transparent by calling set_trans(5)

standard analyzer：set, the, shape, to, semi, transparent, by, calling, set_trans, 5（默认的是standard）

simple analyzer：set, the, shape, to, semi, transparent, by, calling, set, trans

whitespace analyzer：Set, the, shape, to, semi-transparent, by, calling, set_trans(5)

stop analyzer:移除停用词，比如a the it等等

测试：
POST _analyze
{
“analyzer”:”standard”,
“text”:”Set the shape to semi-transparent by calling set_trans(5)”
}

定制分词器

1）默认的分词器
standard
standard tokenizer：以单词边界进行切分
standard token filter：什么都不做
lowercase token filter：将所有字母转换为小写
stop token filer（默认被禁用）：移除停用词，比如a the it等等

2）修改分词器的设置 (了解)
启用english停用词token filter

下面的意思是给分词器起了个名字叫es_std,中括号里面的意思是给这个分词器改造

PUT /my_index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "es_std": {
          "type": "standard",
          "stopwords": "_english_"
        }
      }
    }
  }
}

// "analyzer": "standard"的意思是: 用standard分词器
GET /my_index/_analyze
{
  "analyzer": "standard", 
  "text": "a dog is in the house"
}
 //"analyzer": "es_std"的意思是: 用es_std分词器
GET /my_index/_analyze
{
  "analyzer": "es_std",
  "text":"a dog is in the house"
}

将特殊的字符串进行转换

PUT /my_index
 {
    "settings": {
        "analysis": {
            "char_filter": {
                "&_to_and": {   //这个名字你可以自己改的,类似于你自己取得一个别名
                    "type": "mapping",  // 映射关系
                    "mappings": [   // 这个是数组,多个用逗号隔开
                        "&=> and"  //将&替换成 and
                    ]
                }
            },
            "filter": {  // 停用词的处理
                "my_stopwords": {   // 自己命名好的名字
                    "type": "stop",
                    "stopwords": [  //添加停用词
                        "the",
                        "a"
                    ]
                }
            },
            "analyzer": {
                "my_analyzer": { //自定义分词器名字
                    "type": "custom", //custom代表自定义分词器名字,"type": "custom"不能随便写
                    "char_filter": [  //
                        "html_strip", //es自带的,假如说内容用html标签,就自动过滤掉
                        "&_to_and"   // 这个是你刚刚上面在声明好的
                    ],
                    "tokenizer": "standard",//意思是在standard分词器的基础之上进行扩展
                    "filter": [  //过滤
                        "lowercase",  //大小写转换
                        "my_stopwords" //你自己自定义filter的名字
                    ]
                }
            }
        }
    }
}
// 测试上面分词器带效果
GET /my_index/_analyze
{
    "text": "tom&jerry are a friend in the house, <a>, HAHA!!",
    "analyzer": "my_analyzer"
}
PUT /my_index/_mapping/my_type
{
    "properties": {
        "content": {
            "type": "text",
            "analyzer": "my_analyzer"
        }
    }
}