分词器主要应用在中文上,在ES中字符串类型有keyword和text两种。
keyword默认不进行分词,
text是将每一个汉字拆开称为独立的词,
这两种都是不适用于生产环境,所以我们需要有其他的分词器帮助我们完成这些事情,其中IK分词器是应用最为广泛的一个分词器。
1)keyword类型的分词
GET _analyze{"keyword":"我是程序员"}
{"error": {"root_cause": [{"type": "illegal_argument_exception","reason": "Unknown parameter [keyword] in request body or parameter is of the wrong type[VALUE_STRING] "}],"type": "illegal_argument_exception","reason": "Unknown parameter [keyword] in request body or parameter is of the wrong type[VALUE_STRING] "},"status": 400}
2)text类型的分词
GET _analyze{"text":"我是程序员"}
{"tokens": [{"token": "我","start_offset": 0,"end_offset": 1,"type": "<IDEOGRAPHIC>","position": 0},{"token": "是","start_offset": 1,"end_offset": 2,"type": "<IDEOGRAPHIC>","position": 1},{"token": "程","start_offset": 2,"end_offset": 3,"type": "<IDEOGRAPHIC>","position": 2},{"token": "序","start_offset": 3,"end_offset": 4,"type": "<IDEOGRAPHIC>","position": 3},{"token": "员","start_offset": 4,"end_offset": 5,"type": "<IDEOGRAPHIC>","position": 4}]}
3)IK分词器测试
IK提供了两个分词算法ik_smart 和 ik_max_word,
其中 ik_smart 为最少切分,
ik_max_word为最细粒度划分。
1)最少划分ik_smart
GET _analyze{"analyzer": "ik_smart",(指定分词器)"text":"我是程序员"}
{"tokens" : [{"token" : "我","start_offset" : 0,"end_offset" : 1,"type" : "CN_CHAR","position" : 0},{"token" : "是","start_offset" : 1,"end_offset" : 2,"type" : "CN_CHAR","position" : 1},{"token" : "程序员","start_offset" : 2,"end_offset" : 5,"type" : "CN_WORD","position" : 2}]}
2)最细切分ik_max_word
GET _analyze{"analyzer": "ik_max_word",(指定分词器)"text":"我是程序员"}
{"tokens" : [{"token" : "我","start_offset" : 0,"end_offset" : 1,"type" : "CN_CHAR","position" : 0},{"token" : "是","start_offset" : 1,"end_offset" : 2,"type" : "CN_CHAR","position" : 1},{"token" : "程序员","start_offset" : 2,"end_offset" : 5,"type" : "CN_WORD","position" : 2},{"token" : "程序","start_offset" : 2,"end_offset" : 4,"type" : "CN_WORD","position" : 3},{"token" : "员","start_offset" : 4,"end_offset" : 5,"type" : "CN_CHAR","position" : 4}]}
