分词器主要应用在中文上,在ES中字符串类型有keyword和text两种。
keyword默认不进行分词,
text是将每一个汉字拆开称为独立的词,
这两种都是不适用于生产环境,所以我们需要有其他的分词器帮助我们完成这些事情,其中IK分词器是应用最为广泛的一个分词器。
1)keyword类型的分词
GET _analyze
{
"keyword":"我是程序员"
}
{
"error": {
"root_cause": [
{
"type": "illegal_argument_exception",
"reason": "Unknown parameter [keyword] in request body or parameter is of the wrong type[VALUE_STRING] "
}
],
"type": "illegal_argument_exception",
"reason": "Unknown parameter [keyword] in request body or parameter is of the wrong type[VALUE_STRING] "
},
"status": 400
}
2)text类型的分词
GET _analyze
{
"text":"我是程序员"
}
{
"tokens": [
{
"token": "我",
"start_offset": 0,
"end_offset": 1,
"type": "<IDEOGRAPHIC>",
"position": 0
},
{
"token": "是",
"start_offset": 1,
"end_offset": 2,
"type": "<IDEOGRAPHIC>",
"position": 1
},
{
"token": "程",
"start_offset": 2,
"end_offset": 3,
"type": "<IDEOGRAPHIC>",
"position": 2
},
{
"token": "序",
"start_offset": 3,
"end_offset": 4,
"type": "<IDEOGRAPHIC>",
"position": 3
},
{
"token": "员",
"start_offset": 4,
"end_offset": 5,
"type": "<IDEOGRAPHIC>",
"position": 4
}
]
}
3)IK分词器测试
IK提供了两个分词算法ik_smart 和 ik_max_word,
其中 ik_smart 为最少切分,
ik_max_word为最细粒度划分。
1)最少划分ik_smart
GET _analyze
{
"analyzer": "ik_smart",(指定分词器)
"text":"我是程序员"
}
{
"tokens" : [
{
"token" : "我",
"start_offset" : 0,
"end_offset" : 1,
"type" : "CN_CHAR",
"position" : 0
},
{
"token" : "是",
"start_offset" : 1,
"end_offset" : 2,
"type" : "CN_CHAR",
"position" : 1
},
{
"token" : "程序员",
"start_offset" : 2,
"end_offset" : 5,
"type" : "CN_WORD",
"position" : 2
}
]
}
2)最细切分ik_max_word
GET _analyze
{
"analyzer": "ik_max_word",(指定分词器)
"text":"我是程序员"
}
{
"tokens" : [
{
"token" : "我",
"start_offset" : 0,
"end_offset" : 1,
"type" : "CN_CHAR",
"position" : 0
},
{
"token" : "是",
"start_offset" : 1,
"end_offset" : 2,
"type" : "CN_CHAR",
"position" : 1
},
{
"token" : "程序员",
"start_offset" : 2,
"end_offset" : 5,
"type" : "CN_WORD",
"position" : 2
},
{
"token" : "程序",
"start_offset" : 2,
"end_offset" : 4,
"type" : "CN_WORD",
"position" : 3
},
{
"token" : "员",
"start_offset" : 4,
"end_offset" : 5,
"type" : "CN_CHAR",
"position" : 4
}
]
}