概述
这个分词器在对中文人名、药品名 等检索的时候很犀利,无脑的将词分隔成成为几个字连接起来,比如配置一个分词器。
先将定义的my_tokenizer添加到索引my-index中,代码如下所示
PUT my-index{"settings": {"analysis": {"analyzer": {"my_analyzer": {"tokenizer": "my_tokenizer"}},"tokenizer": {"my_tokenizer": {"type": "ngram","min_gram": 2,"max_gram": 3,"token_chars": ["letter","digit"]}}}}}
其中my_tokenizer有 2 个参数min_gram 表示分词的最小分多少,max_gram表示分词最大分多少。
中文人名
接下来我们来测试下
POST my-index/_analyze{"analyzer": "my_analyzer","text": "梁朝伟"}
结果是
{"tokens" : [{"token" : "梁朝","start_offset" : 0,"end_offset" : 2,"type" : "word","position" : 0},{"token" : "梁朝伟","start_offset" : 0,"end_offset" : 3,"type" : "word","position" : 1},{"token" : "朝伟","start_offset" : 1,"end_offset" : 3,"type" : "word","position" : 2}]}
药物名
POST my-index/_analyze{"analyzer": "my_analyzer","text": "阿莫西林克拉维酸钾"}
结果
{"tokens" : [{"token" : "阿莫","start_offset" : 0,"end_offset" : 2,"type" : "word","position" : 0},{"token" : "阿莫西","start_offset" : 0,"end_offset" : 3,"type" : "word","position" : 1},{"token" : "莫西","start_offset" : 1,"end_offset" : 3,"type" : "word","position" : 2},{"token" : "莫西林","start_offset" : 1,"end_offset" : 4,"type" : "word","position" : 3},{"token" : "西林","start_offset" : 2,"end_offset" : 4,"type" : "word","position" : 4},{"token" : "西林克","start_offset" : 2,"end_offset" : 5,"type" : "word","position" : 5},{"token" : "林克","start_offset" : 3,"end_offset" : 5,"type" : "word","position" : 6},{"token" : "林克拉","start_offset" : 3,"end_offset" : 6,"type" : "word","position" : 7},{"token" : "克拉","start_offset" : 4,"end_offset" : 6,"type" : "word","position" : 8},{"token" : "克拉维","start_offset" : 4,"end_offset" : 7,"type" : "word","position" : 9},{"token" : "拉维","start_offset" : 5,"end_offset" : 7,"type" : "word","position" : 10},{"token" : "拉维酸","start_offset" : 5,"end_offset" : 8,"type" : "word","position" : 11},{"token" : "维酸","start_offset" : 6,"end_offset" : 8,"type" : "word","position" : 12},{"token" : "维酸钾","start_offset" : 6,"end_offset" : 9,"type" : "word","position" : 13},{"token" : "酸钾","start_offset" : 7,"end_offset" : 9,"type" : "word","position" : 14}]}
这样搜索就非常简单了。
