1.概念

1.补全api主要分为四类

  1. Term Suggester(纠错补全,输入错误的情况下补全正确的单词)
  2. Phrase Suggester(自动补全短语,输入一个单词补全整个短语)
  3. Completion Suggester(完成补全单词,输出如前半部分,补全整个单词)
  4. Context Suggester(上下文补全)

整体效果类似百度搜索,如图:
ES系列十三、Elasticsearch Suggester API(自动补全) - 图1

2.Term Suggester(纠错补全)

2.1.api

1.建立索引

  1. PUT /book4
  2. {
  3. "mappings": {
  4. "english": {
  5. "properties": {
  6. "passage": {
  7. "type": "text"
  8. }
  9. }
  10. }
  11. }
  12. }

2.2.插入数据

  1. curl -H "Content-Type: application/json" -XPOST 'http:localhost:9200/_bulk' -d'
  2. { "index" : { "_index" : "book4", "_type" : "english" } }
  3. { "passage": "Lucene is cool"}
  4. { "index" : { "_index" : "book4", "_type" : "english" } }
  5. { "passage": "Elasticsearch builds on top of lucene"}
  6. { "index" : { "_index" : "book4", "_type" : "english" } }
  7. { "passage": "Elasticsearch rocks"}
  8. { "index" : { "_index" : "book4", "_type" : "english" } }
  9. { "passage": "Elastic is the company behind ELK stack"}
  10. { "index" : { "_index" : "book4", "_type" : "english" } }
  11. { "passage": "elk rocks"}
  12. { "index" : { "_index" : "book4", "_type" : "english" } }
  13. { "passage": "elasticsearch is rock solid"}
  14. '

2.3.看下储存的分词有哪些

  1. post /_analyze
  2. {
  3. "text": [
  4. "Lucene is cool",
  5. "Elasticsearch builds on top of lucene",
  6. "Elasticsearch rocks",
  7. "Elastic is the company behind ELK stack",
  8. "elk rocks",
  9. "elasticsearch is rock solid"
  10. ]
  11. }

结果:

  1. {
  2. "tokens": [
  3. {
  4. "token": "lucene",
  5. "start_offset": 0,
  6. "end_offset": 6,
  7. "type": "<ALPHANUM>",
  8. "position": 0
  9. },
  10. {
  11. "token": "is",
  12. "start_offset": 7,
  13. "end_offset": 9,
  14. "type": "<ALPHANUM>",
  15. "position": 1
  16. },
  17. {
  18. "token": "cool",
  19. "start_offset": 10,
  20. "end_offset": 14,
  21. "type": "<ALPHANUM>",
  22. "position": 2
  23. },
  24. {
  25. "token": "elasticsearch",
  26. "start_offset": 15,
  27. "end_offset": 28,
  28. "type": "<ALPHANUM>",
  29. "position": 103
  30. },
  31. {
  32. "token": "builds",
  33. "start_offset": 29,
  34. "end_offset": 35,
  35. "type": "<ALPHANUM>",
  36. "position": 104
  37. },
  38. {
  39. "token": "on",
  40. "start_offset": 36,
  41. "end_offset": 38,
  42. "type": "<ALPHANUM>",
  43. "position": 105
  44. },
  45. {
  46. "token": "top",
  47. "start_offset": 39,
  48. "end_offset": 42,
  49. "type": "<ALPHANUM>",
  50. "position": 106
  51. },
  52. {
  53. "token": "of",
  54. "start_offset": 43,
  55. "end_offset": 45,
  56. "type": "<ALPHANUM>",
  57. "position": 107
  58. },
  59. {
  60. "token": "lucene",
  61. "start_offset": 46,
  62. "end_offset": 52,
  63. "type": "<ALPHANUM>",
  64. "position": 108
  65. },
  66. {
  67. "token": "elasticsearch",
  68. "start_offset": 53,
  69. "end_offset": 66,
  70. "type": "<ALPHANUM>",
  71. "position": 209
  72. },
  73. {
  74. "token": "rocks",
  75. "start_offset": 67,
  76. "end_offset": 72,
  77. "type": "<ALPHANUM>",
  78. "position": 210
  79. },
  80. {
  81. "token": "elastic",
  82. "start_offset": 73,
  83. "end_offset": 80,
  84. "type": "<ALPHANUM>",
  85. "position": 311
  86. },
  87. {
  88. "token": "is",
  89. "start_offset": 81,
  90. "end_offset": 83,
  91. "type": "<ALPHANUM>",
  92. "position": 312
  93. },
  94. {
  95. "token": "the",
  96. "start_offset": 84,
  97. "end_offset": 87,
  98. "type": "<ALPHANUM>",
  99. "position": 313
  100. },
  101. {
  102. "token": "company",
  103. "start_offset": 88,
  104. "end_offset": 95,
  105. "type": "<ALPHANUM>",
  106. "position": 314
  107. },
  108. {
  109. "token": "behind",
  110. "start_offset": 96,
  111. "end_offset": 102,
  112. "type": "<ALPHANUM>",
  113. "position": 315
  114. },
  115. {
  116. "token": "elk",
  117. "start_offset": 103,
  118. "end_offset": 106,
  119. "type": "<ALPHANUM>",
  120. "position": 316
  121. },
  122. {
  123. "token": "stack",
  124. "start_offset": 107,
  125. "end_offset": 112,
  126. "type": "<ALPHANUM>",
  127. "position": 317
  128. },
  129. {
  130. "token": "elk",
  131. "start_offset": 113,
  132. "end_offset": 116,
  133. "type": "<ALPHANUM>",
  134. "position": 418
  135. },
  136. {
  137. "token": "rocks",
  138. "start_offset": 117,
  139. "end_offset": 122,
  140. "type": "<ALPHANUM>",
  141. "position": 419
  142. },
  143. {
  144. "token": "elasticsearch",
  145. "start_offset": 123,
  146. "end_offset": 136,
  147. "type": "<ALPHANUM>",
  148. "position": 520
  149. },
  150. {
  151. "token": "is",
  152. "start_offset": 137,
  153. "end_offset": 139,
  154. "type": "<ALPHANUM>",
  155. "position": 521
  156. },
  157. {
  158. "token": "rock",
  159. "start_offset": 140,
  160. "end_offset": 144,
  161. "type": "<ALPHANUM>",
  162. "position": 522
  163. },
  164. {
  165. "token": "solid",
  166. "start_offset": 145,
  167. "end_offset": 150,
  168. "type": "<ALPHANUM>",
  169. "position": 523
  170. }
  171. ]
  172. }

2.4.term suggest api(搜索单个字段)

搜索下试试,给出错误单词Elasticsearaach

  1. POST /book4/_search
  2. {
  3. "suggest" : {
  4. "my-suggestion" : {
  5. "text" : "Elasticsearaach",
  6. "term" : {
  7. "field" : "passage"
  8.  "suggest_mode": "popular"
  9. }
  10. }
  11. }
  12. }

response:

  1. {
  2. "took": 26,
  3. "timed_out": false,
  4. "_shards": {
  5. "total": 5,
  6. "successful": 5,
  7. "skipped": 0,
  8. "failed": 0
  9. },
  10. "hits": {
  11. "total": 0,
  12. "max_score": 0,
  13. "hits": []
  14. },
  15. "suggest": {
  16. "my-suggestion": [
  17. {
  18. "text": "elasticsearaach",
  19. "offset": 0,
  20. "length": 15,
  21. "options": [
  22. {
  23. "text": "elasticsearch",
  24. "score": 0.84615386,
  25. "freq": 3
  26. }
  27. ]
  28. }
  29. ]
  30. }
  31. }

2.5.搜索多个字段分别给出提示:

  1. POST _search
  2. {
  3. "suggest": {
  4. "my-suggest-1" : {
  5. "text" : "tring out Elasticsearch",
  6. "term" : {
  7. "field" : "message"
  8. }
  9. },
  10. "my-suggest-2" : {
  11. "text" : "kmichy",
  12. "term" : {
  13. "field" : "user"
  14. }
  15. }
  16. }
  17. }


term建议者提出基于编辑距离条款。在建议术语之前分析提供的建议文本。建议的术语是根据分析的建议文本标记提供的。该term建议者不走查询到的是是的请求部分。

常见建议选项:

text 建议文字。建议文本是必需的选项,需要全局或按建议设置。
field 从中获取候选建议的字段。这是一个必需的选项,需要全局设置或根据建议设置。
analyzer 用于分析建议文本的分析器。默认为建议字段的搜索分析器。
size 每个建议文本标记返回的最大更正。
sort 定义如何根据建议文本术语对建议进行排序。两个可能的值:
- score:先按分数排序,然后按文档频率排序,再按术语本身排序。
- frequency:首先按文档频率排序,然后按相似性分数排序,然后按术语本身排序。
suggest_mode 建议模式控制包含哪些建议或控制建议的文本术语,建议。可以指定三个可能的值:
- missing:仅提供不在索引词典中,但是在原文档中的词。这是默认值。
- popular:仅提供在索引词典中出现的词语。
- always:索引词典中出没出现的词语都要给出建议。

其他术语建议选项:

lowercase_terms 在文本分析之后,建议文本术语小写。
max_edits 最大编辑距离候选建议可以具有以便被视为建议。只能是介于1和2之间的值。任何其他值都会导致抛出错误的请求错误。默认为2。
prefix_length 必须匹配的最小前缀字符的数量才是候选建议。默认为1.增加此数字可提高拼写检查性能。通常拼写错误不会出现在术语的开头。(旧名“prefix_len”已弃用)
min_word_length 建议文本术语必须具有的最小长度才能包含在内。默认为4.(旧名称“min_word_len”已弃用)
shard_size 设置从每个单独分片中检索的最大建议数。在减少阶段,仅根据size选项返回前N个建议。默认为该 size选项。将此值设置为高于该值的值size可能非常有用,以便以性能为代价获得更准确的拼写更正文档频率。由于术语在分片之间被划分,因此拼写校正频率的分片级文档可能不准确。增加这些将使这些文档频率更精确。
max_inspections 用于乘以的因子, shards_size以便在碎片级别上检查更多候选拼写更正。可以以性能为代价提高准确性。默认为5。
min_doc_freq 建议应出现的文档数量的最小阈值。可以指定为绝对数字或文档数量的相对百分比。这可以仅通过建议高频项来提高质量。默认为0f且未启用。如果指定的值大于1,则该数字不能是小数。分片级文档频率用于此选项。
max_term_freq 建议文本令牌可以存在的文档数量的最大阈值,以便包括在内。可以是表示文档频率的相对百分比数(例如0.4)或绝对数。如果指定的值大于1,则不能指定小数。默认为0.01f。这可用于排除高频术语的拼写检查。高频术语通常拼写正确,这也提高了拼写检查的性能。分片级文档频率用于此选项。
string_distance 用于比较类似建议术语的字符串距离实现。可以指定五个可能的值: internal- 默认值基于damerau_levenshtein,但高度优化用于比较索引中术语的字符串距离。damerau_levenshtein - 基于Damerau-Levenshtein算法的字符串距离算法。levenshtein - 基于Levenshtein编辑距离算法的字符串距离算法。 jaro_winkler - 基于Jaro-Winkler算法的字符串距离算法。 ngram - 基于字符n-gram的字符串距离算法。

官方api

3.phase sguesster:短语纠错

phrase 短语建议,在term的基础上,会考量多个term之间的关系,比如是否同时出现在索引的原文里,相邻程度,以及词频等
示例1:

  1. POST book4/_search
  2. {
  3.   "suggest" : {
  4. "myss":{
  5. "text": "Elasticsearch rock",
  6. "phrase": {
  7. "field": "passage"
  8. }
  9. }
  10. }
  11. }
  1. {
  2. "took": 11,
  3. "timed_out": false,
  4. "_shards": {
  5. "total": 5,
  6. "successful": 5,
  7. "skipped": 0,
  8. "failed": 0
  9. },
  10. "hits": {
  11. "total": 0,
  12. "max_score": 0,
  13. "hits": []
  14. },
  15. "suggest": {
  16. "myss": [
  17. {
  18. "text": "Elasticsearch rock",
  19. "offset": 0,
  20. "length": 18,
  21. "options": [
  22. {
  23. "text": "elasticsearch rocks",
  24. "score": 0.3467123
  25. }
  26. ]
  27. }
  28. ]
  29. }
  30. }

4. Completion suggester 自动补全

针对自动补全场景而设计的建议器。此场景下用户每输入一个字符的时候,就需要即时发送一次查询请求到后端查找匹配项,在用户输入速度较高的情况下对后端响应速度要求比较苛刻。因此实现上它和前面两个Suggester采用了不同的数据结构,索引并非通过倒排来完成,而是将analyze过的数据编码成FST和索引一起存放。对于一个open状态的索引,FST会被ES整个装载到内存里的,进行前缀查找速度极快。但是FST只能用于前缀查找,这也是Completion Suggester的局限所在

1.建立索引

  1. POST /book5
  2. {
  3. "mappings": {
  4. "music" : {
  5. "properties" : {
  6. "suggest" : {
  7. "type" : "completion"
  8. },
  9. "title" : {
  10. "type": "keyword"
  11. }
  12. }
  13. }
  14. }
  15. }

插入数据:

  1. POST /book5
  2. {
  3. "mappings": {
  4. "music" : {
  5. "properties" : {
  6. "suggest" : {
  7. "type" : "completion"
  8. },
  9. "title" : {
  10. "type": "keyword"
  11. }
  12. }
  13. }
  14. }
  15. }

Input 指定输入词 Weight 指定排序值(可选)

  1. PUT music/music/5nupmmUBYLvVFwGWH3cu?refresh
  2. {
  3. "suggest" : {
  4. "input": [ "test", "book" ],
  5. "weight" : 34
  6. }
  7. }

指定不同的排序值:

  1. PUT music/_doc/6Hu2mmUBYLvVFwGWxXef?refresh
  2. {
  3. "suggest" : [
  4. {
  5. "input": "test",
  6. "weight" : 10
  7. },
  8. {
  9. "input": "good",
  10. "weight" : 3
  11. }
  12. ]}

示例1:查询建议根据前缀查询

  1. POST book5/_search?pretty
  2. {
  3. "suggest": {
  4. "song-suggest" : {
  5. "prefix" : "te",
  6. "completion" : {
  7. "field" : "suggest"
  8. }
  9. }
  10. }
  11. }
  1. {
  2. "took": 8,
  3. "timed_out": false,
  4. "_shards": {
  5. "total": 5,
  6. "successful": 5,
  7. "skipped": 0,
  8. "failed": 0
  9. },
  10. "hits": {
  11. "total": 0,
  12. "max_score": 0,
  13. "hits": []
  14. },
  15. "suggest": {
  16. "song-suggest": [
  17. {
  18. "text": "te",
  19. "offset": 0,
  20. "length": 2,
  21. "options": [
  22. {
  23. "text": "test my book1",
  24. "_index": "book5",
  25. "_type": "music",
  26. "_id": "6Xu6mmUBYLvVFwGWpXeL",
  27. "_score": 1,
  28. "_source": {
  29. "suggest": "test my book1"
  30. }
  31. },
  32. {
  33. "text": "test my book1",
  34. "_index": "book5",
  35. "_type": "music",
  36. "_id": "6nu8mmUBYLvVFwGWSndC",
  37. "_score": 1,
  38. "_source": {
  39. "suggest": "test my book1"
  40. }
  41. },
  42. {
  43. "text": "test my book1 english",
  44. "_index": "book5",
  45. "_type": "music",
  46. "_id": "63u8mmUBYLvVFwGWZHdC",
  47. "_score": 1,
  48. "_source": {
  49. "suggest": "test my book1 english"
  50. }
  51. }
  52. ]
  53. }
  54. ]
  55. }
  56. }

示例2:对建议查询结果去重

  1. {
  2. "suggest": {
  3. "song-suggest" : {
  4. "prefix" : "te",
  5. "completion" : {
  6. "field" : "suggest" ,
  7. "skip_duplicates": true
  8. }
  9. }
  10. }
  11. }

示例3:查询建议文档存储短语

  1. POST /book5/music/63u8mmUBYLvVFwGWZHdC?refresh
  2. {
  3. "suggest" : {
  4. "input": [ "book1 english", "test english" ],
  5. "weight" : 20
  6. }
  7. }

查询:

  1. POST book5/_search?pretty
  2. {
  3. "suggest": {
  4. "song-suggest" : {
  5. "prefix" : "test",
  6. "completion" : {
  7. "field" : "suggest" ,
  8. "skip_duplicates": true
  9. }
  10. }
  11. }
  12. }

结果:

  1. {
  2. "took": 7,
  3. "timed_out": false,
  4. "_shards": {
  5. "total": 5,
  6. "successful": 5,
  7. "skipped": 0,
  8. "failed": 0
  9. },
  10. "hits": {
  11. "total": 0,
  12. "max_score": 0,
  13. "hits": []
  14. },
  15. "suggest": {
  16. "song-suggest": [
  17. {
  18. "text": "test",
  19. "offset": 0,
  20. "length": 4,
  21. "options": [
  22. {
  23. "text": "test english",
  24. "_index": "book5",
  25. "_type": "music",
  26. "_id": "63u8mmUBYLvVFwGWZHdC",
  27. "_score": 20,
  28. "_source": {
  29. "suggest": {
  30. "input": [
  31. "book1 english",
  32. "test english"
  33. ],
  34. "weight": 20
  35. }
  36. }
  37. },
  38. {
  39. "text": "test my book1",
  40. "_index": "book5",
  41. "_type": "music",
  42. "_id": "6Xu6mmUBYLvVFwGWpXeL",
  43. "_score": 1,
  44. "_source": {
  45. "suggest": "test my book1"
  46. }
  47. }
  48. ]
  49. }
  50. ]
  51. }
  52. }

5. 总结和建议

因此用好Completion Sugester并不是一件容易的事,实际应用开发过程中,需要根据数据特性和业务需要,灵活搭配analyzer和mapping参数,反复调试才可能获得理想的补全效果。

回到篇首搜索框的补全/纠错功能,如果用ES怎么实现呢?我能想到的一个的实现方式:

  1. 在用户刚开始输入的过程中,使用Completion Suggester进行关键词前缀匹配,刚开始匹配项会比较多,随着用户输入字符增多,匹配项越来越少。如果用户输入比较精准,可能Completion Suggester的结果已经够好,用户已经可以看到理想的备选项了。
  2. 如果Completion Suggester已经到了零匹配,那么可以猜测是否用户有输入错误,这时候可以尝试一下Phrase Suggester。
  3. 如果Phrase Suggester没有找到任何option,开始尝试term Suggester。


    精准程度上(Precision)看: Completion > Phrase > term, 而召回率上(Recall)则反之。从性能上看,Completion Suggester是最快的,如果能满足业务需求,只用Completion Suggester做前缀匹配是最理想的。 Phrase和Term由于是做倒排索引的搜索,相比较而言性能应该要低不少,应尽量控制suggester用到的索引的数据量,最理想的状况是经过一定时间预热后,索引可以全量map到内存。