🌈ES - IK分词器 - 《BigData》

1）keyword类型的分词
2）text类型的分词
3）IK分词器测试
- 1）最少划分ik_smart
- 2）最细切分ik_max_word

分词器主要应用在中文上，在ES中字符串类型有keyword和text两种。
keyword默认不进行分词，
text是将每一个汉字拆开称为独立的词，
这两种都是不适用于生产环境，所以我们需要有其他的分词器帮助我们完成这些事情，其中IK分词器是应用最为广泛的一个分词器。

1）keyword类型的分词

GET _analyze
{
  "keyword":"我是程序员"
}

{
  "error": {
    "root_cause": [
      {
        "type": "illegal_argument_exception",
        "reason": "Unknown parameter [keyword] in request body or parameter is of the wrong type[VALUE_STRING] "
      }
    ],
    "type": "illegal_argument_exception",
    "reason": "Unknown parameter [keyword] in request body or parameter is of the wrong type[VALUE_STRING] "
  },
  "status": 400
}

2）text类型的分词

GET _analyze
{
  "text":"我是程序员"
}

{
  "tokens": [
    {
      "token": "我",
      "start_offset": 0,
      "end_offset": 1,
      "type": "<IDEOGRAPHIC>",
      "position": 0
    },
    {
      "token": "是",
      "start_offset": 1,
      "end_offset": 2,
      "type": "<IDEOGRAPHIC>",
      "position": 1
    },
    {
      "token": "程",
      "start_offset": 2,
      "end_offset": 3,
      "type": "<IDEOGRAPHIC>",
      "position": 2
    },
    {
      "token": "序",
      "start_offset": 3,
      "end_offset": 4,
      "type": "<IDEOGRAPHIC>",
      "position": 3
    },
    {
      "token": "员",
      "start_offset": 4,
      "end_offset": 5,
      "type": "<IDEOGRAPHIC>",
      "position": 4
    }
  ]
}

3）IK分词器测试

IK提供了两个分词算法ik_smart 和 ik_max_word，
其中 ik_smart 为最少切分，
ik_max_word为最细粒度划分。

1）最少划分ik_smart

GET _analyze
{
  "analyzer": "ik_smart",（指定分词器）
  "text":"我是程序员"
}

{
    "tokens" : [
        {
            "token" : "我",
            "start_offset" : 0,
            "end_offset" : 1,
            "type" : "CN_CHAR",
            "position" : 0
        },
        {
            "token" : "是",
            "start_offset" : 1,
            "end_offset" : 2,
            "type" : "CN_CHAR",
            "position" : 1
        },
        {
            "token" : "程序员",
            "start_offset" : 2,
            "end_offset" : 5,
            "type" : "CN_WORD",
            "position" : 2
        }
    ]
}

2）最细切分ik_max_word

GET _analyze
{
  "analyzer": "ik_max_word",（指定分词器）
  "text":"我是程序员"
}

{
    "tokens" : [
        {
            "token" : "我",
            "start_offset" : 0,
            "end_offset" : 1,
            "type" : "CN_CHAR",
            "position" : 0
        },
        {
            "token" : "是",
            "start_offset" : 1,
            "end_offset" : 2,
            "type" : "CN_CHAR",
            "position" : 1
        },
        {
            "token" : "程序员",
            "start_offset" : 2,
            "end_offset" : 5,
            "type" : "CN_WORD",
            "position" : 2
        },
        {
            "token" : "程序",
            "start_offset" : 2,
            "end_offset" : 4,
            "type" : "CN_WORD",
            "position" : 3
        },
        {
            "token" : "员",
            "start_offset" : 4,
            "end_offset" : 5,
            "type" : "CN_CHAR",
            "position" : 4
        }
    ]
}