如何实现

利用ES的suggest completion提示器进行关键词补全,completion提示器是由前缀树实现的,数据全部装载在内存中,速度极快。

现有的数据

每天每个关键词的 PV UV 搜索次数 无结果次数

自动补全实现

pv/uv 每个用户有几张有效数据 (系数 2)
pv/search_cnt 每次搜索有几个有效数据 (系数1)
search_cnt/uv 每个用户搜几次 (系数0.05)
no_results_pv/search_cnt 每次搜索有几个无效图 (系数-0.5 负反馈)

· 对七天内数据(PV UV search_cnt)进行加权融合 算出权重
· 存入ES
· 清洗敏感词,补全拼音

建立索引模板

  1. PUT /_template/keyword_suggest
  2. {
  3. "order": 1,
  4. "index_patterns": [
  5. "*_keyword_suggest"
  6. ],
  7. "settings": {
  8. "analysis": {
  9. "analyzer": {
  10. "prefix_pinyin_analyzer": {
  11. "tokenizer": "standard",
  12. "filter": [
  13. "lowercase",
  14. "prefix_pinyin"
  15. ]
  16. },
  17. "full_pinyin_analyzer": {
  18. "tokenizer": "standard",
  19. "filter": [
  20. "lowercase",
  21. "full_pinyin"
  22. ]
  23. }
  24. },
  25. "filter": {
  26. "_pattern": {
  27. "type": "pattern_capture",
  28. "preserve_original": true,
  29. "patterns": [
  30. "([0-9])",
  31. "([a-z])"
  32. ]
  33. },
  34. "prefix_pinyin": {
  35. "type": "pinyin",
  36. "keep_first_letter": true,
  37. "keep_full_pinyin": false,
  38. "none_chinese_pinyin_tokenize": false,
  39. "keep_original": false
  40. },
  41. "full_pinyin": {
  42. "type": "pinyin",
  43. "keep_first_letter": false,
  44. "keep_full_pinyin": true,
  45. "keep_original": false,
  46. "keep_none_chinese_in_first_letter": false
  47. }
  48. }
  49. }
  50. },
  51. "mappings": {
  52. "properties": {
  53. "id": {
  54. "type": "keyword"
  55. },
  56. "suggestText": {
  57. "type": "completion",
  58. "analyzer": "standard",
  59. "preserve_separators": false,
  60. "preserve_position_increments": true,
  61. "max_input_length": 50
  62. },
  63. "prefix_pinyin": {
  64. "type": "completion",
  65. "analyzer": "prefix_pinyin_analyzer",
  66. "search_analyzer": "standard",
  67. "preserve_separators": false
  68. },
  69. "full_pinyin": {
  70. "type": "completion",
  71. "analyzer": "full_pinyin_analyzer",
  72. "search_analyzer": "full_pinyin_analyzer",
  73. "preserve_separators": false
  74. }
  75. }
  76. },
  77. "aliases": {
  78. "keyword_suggest": {}
  79. }
  80. }

通过pinyin分词器,只要全拼正确,即使有错别字也可以识别。

spark计算

spark代码涵盖的助手函数太多,不放代码了
思路为查出所有数据后使用

  1. DataFrame = DataFrame.withColumn("weight", col =
  2. (DataFrame("pv") / DataFrame("uv") * 2 +
  3. DataFrame("pv") / DataFrame("search_cnt") +
  4. DataFrame("search_cnt") / DataFrame("uv") * 0.05 -
  5. DataFrame("no_results_pv") / DataFrame("search_cnt") * 0.5) * 3.5
  6. )

添加权重列
筛选出keyword和weight,直接存入es

清洗关键词及拼音补全

laravel定时任务,使用超哥的https://github.com/overtrue/pinyin 进行拼音补全

  1. public function handle()
  2. {
  3. //内存型
  4. $pinyin = new Pinyin('Overtrue\Pinyin\MemoryFileDictLoader');
  5. $scroll_id = null;
  6. KeywordSuggest::$search_index = str_replace('day', Carbon::yesterday()->format('Ymd'), KeywordSuggest::$search_index);
  7. foreach (KeywordSuggest::scrollIndex([], 1000, null, null, $scroll_id) as $value)
  8. {
  9. list($suggests, $scroll_id) = $value;
  10. $params = ['body' => []];
  11. foreach ($suggests as $item)
  12. {
  13. $suggest = $item['_source'];
  14. $id = array_get($suggest, 'id');
  15. $suggest_text = array_get($suggest, 'suggestText.input.0');
  16. $weight = array_get($suggest, 'suggestText.weight');
  17. if($suggest_text && Video::getBadWord($suggest_text))
  18. {
  19. KeywordSuggest::deleteIndex($id, false);
  20. continue;
  21. }
  22. if (!preg_match("/^[\x{4e00}-\x{9fa5}A-Za-z0-9]+$/u", $suggest_text))
  23. {
  24. KeywordSuggest::deleteIndex($id, false);
  25. continue;
  26. }
  27. $full_pinyin = implode($pinyin->convert($suggest_text));
  28. $prefix_pinyin = $pinyin->abbr($suggest_text);
  29. $params['body'][] = [
  30. 'update' => [
  31. '_index' => KeywordSuggest::$search_index,
  32. '_type' => '_doc',
  33. '_id' => $id
  34. ]
  35. ];
  36. $params['body'][]['doc'] = [
  37. 'full_pinyin' => [
  38. 'input' => [$full_pinyin],
  39. 'weight' => $weight
  40. ],
  41. 'prefix_pinyin' => [
  42. 'input' => [$prefix_pinyin],
  43. 'weight' => $weight
  44. ],
  45. ];
  46. }
  47. KeywordSuggest::elasticSearchclient()->bulk($params);
  48. unset($params);
  49. $this->info($scroll_id);
  50. }
  51. $this->info('success');
  52. }

使用

DSL如下

  1. $body['suggest'] = [
  2. "prefix_pinyin" => [
  3. 'prefix' => $keyword,
  4. "completion" => [
  5. 'field' => 'prefix_pinyin',
  6. 'skip_duplicates' => true
  7. ]
  8. ],
  9. "full_pinyin" => [
  10. 'prefix' => $keyword,
  11. "completion" => [
  12. 'field' => 'full_pinyin',
  13. 'skip_duplicates' => true
  14. ]
  15. ],
  16. "suggestText" => [
  17. 'prefix' => $keyword,
  18. "completion" => [
  19. 'field' => 'suggestText',
  20. 'skip_duplicates' => true
  21. ]
  22. ],
  23. ];
  24. $body['_source']['includes'] = 'suggestText';

结果

image.png