5. Script

Elasticsearch支持3种脚本语言:Groovy(1.4.x - 5.0 )、painless、experssion。
Groovy
安全性问题Elasticsearch5.0以后已经放弃。
painless
painless是一种专门用于Elasticsearch的简单,用于内联和存储脚本,类似于Java,也有注释、关键字、类型、变量、函数等,安全的脚本语言。它是Elasticsearch的默认脚本语言,可以安全地用于内联和存储脚本。
expression:
每个文档的开销较低:表达式的作用更多,可以非常快速地执行,甚至比编写native脚本还要快,支持javascript语法的子集:单个表达式。缺点:只能访问数字,布尔值,日期和geo_point字段,存储的字段不可用.

脚本模板:

  1. {METHOD} {_index}/{action}/{_id}
  2. {
  3. "script":{
  4. "lang": "painless/expression",
  5. "source": "ctx._source.{field} {option} {值} "
  6. }
  7. }

使用 painless 脚本将 _index 为“order”_id 为 “1”的premium 字段减1

  1. POST order/_update/1
  2. {
  3. "script":{
  4. "lang": "painless",
  5. "source": "ctx._source.premium -= 1"
  6. }
  7. }

使用 painless 脚本将 _index 为“order”_id 为 “1”的tenderAddress字段后面拼接一个字符串”666”

  1. POST order/_update/1
  2. {
  3. "script":{
  4. "lang": "painless",
  5. "source": "ctx._source.tenderAddress +='66666' "
  6. }
  7. }

使用 painless 脚本删除 _index 为“order”_id 为 “1”的doc

  1. POST order/_update/1
  2. {
  3. "script":{
  4. "lang": "painless",
  5. "source": "ctx.op='delete'"
  6. }
  7. }

upsert是update+insert的简写。所以使用upsert后,如果doc存在的话则是更新操作, 如果doc不存在则是插入操作。
_index 为“order”_id 为 “1”的doc下的tags字段是一个数组。

  1. POST order/_update/1
  2. {
  3. "script": {
  4. "lang": "painless",
  5. "source": "ctx._source.tags.add(params.zhangdeshi)",
  6. "params": {
  7. "zhangdeshi":"真TM帅",
  8. "xiangeshi":"真JB差"
  9. }
  10. },
  11. "upsert": {
  12. "age":"30",
  13. "name":"李明",
  14. "tags":[]
  15. }
  16. }

使用expression表达式查询一个premium字段

  1. GET {_index}/_search
  2. {
  3. "script_fields": {
  4. "{给查询结果起个名字}": {
  5. "script": {
  6. "lang": "expression",
  7. "source":"doc['premium']"
  8. }
  9. }
  10. }
  11. }

使用expression表达式对premium字段打9折

  1. GET {_index}/_search
  2. {
  3. "script_fields": {
  4. "{给查询结果起个名字}": {
  5. "script": {
  6. "lang": "expression",
  7. "source":"doc['premium'].value*0.9"
  8. }
  9. }
  10. }
  11. }

使用painless 查询一个premium字段打9折、打8折,使用参数

  1. GET order/_search
  2. {
  3. "script_fields": {
  4. "{给查询结果起个名字}": {
  5. "script": {
  6. "lang": "painless",
  7. "source":"[doc['premium'].value*params.da9zhe,doc['premium'].value*params.da8zhe]",
  8. "params": {
  9. "da9zhe":0.9,
  10. "da8zhe":0.8
  11. }
  12. }
  13. }
  14. }
  15. }

Stored script

  1. 先保存一个script模板

    1. POST _scripts/{script-id}
    2. {
    3. "script": {
    4. "lang": "painless",
    5. "source":"[doc['premium'].value*params.da9zhe,doc['premium'].value*params.da8zhe]",
    6. "params": {
    7. "da9zhe":0.9,
    8. "da8zhe":0.8
    9. }
    10. }
    11. }
  2. 使用模板

    1. POST order/_search
    2. {
    3. "script_fields": {
    4. "{给查询结果起个名字}": {
    5. "script": {
    6. "id":"{script-id}",
    7. "params": {
    8. "da9zhe":0.009,
    9. "da8zhe":0.008
    10. }
    11. }
    12. }
    13. }
    14. }

使用自定义脚本,类似于存储过程,可以在里面写一些逻辑, if for 或者一些自定义运算。

  1. POST order/_update/2
  2. {
  3. "script": {
  4. "lang": "painless",
  5. "source": """
  6. ctx._source.premium *= ctx._source.rate;
  7. ctx._source.amount += params.danwei
  8. """,
  9. "params": {
  10. "danwei":"万元"
  11. }
  12. }
  13. }

查找premium大于等于10小于等于205的数据,并统计查找到的数据的insuranceCompany字段字符的总数。

  1. GET order/_search
  2. {
  3. "query": {
  4. "bool": {
  5. "filter": [
  6. {
  7. "range": {
  8. "premium": {
  9. "gte": 10,
  10. "lte": 205
  11. }
  12. }
  13. }
  14. ]
  15. }
  16. },
  17. "aggs": {
  18. "{给聚合查村结果起个名字}": {
  19. "sum": {
  20. "script": {
  21. "lang": "painless",
  22. "source": """
  23. int total = 0;
  24. for(int i = 0; i < doc['insuranceCompany.keyword'].length; i++){
  25. total++;
  26. }
  27. return total;
  28. """
  29. }
  30. }
  31. }
  32. },
  33. "size": 0
  34. }

写入原理
image.png

6. 分词器

分词器里面有3个最重要的属性:

  1. char_filter

    分词之前字符预处理,过滤无用字符、无用标签等。 比如: 《Elasticsearch》=> Elasticsearch 新年好 =>“新年好”。

  2. tokenizer

    真正分词用的

  3. token_filter

    停用词、时态转换、大小写转换、同义词转换、语气词处理等。 比如: has=>have
    him=>he
    apples=>apple
    the/oh/a => 去掉

6.1 设置char_filter

  1. 新增索引名称为“my-index”,并指定分词器内的字符过滤器 ```json

    新增索引名称为“my-index”,并指定分词器内的字符过滤器

    PUT my-index { “settings”: { “analysis”: { //指定字符过滤器,名字为“my-charfilter”,指定的字符会被转换为其他内容 “char_filter”: {
    1. "my-charfilter":{ //指定字符过滤器名字
    2. "type":"mapping", //指定过滤器类型
    3. "mappings":["&=>and","&&=>and","|=>or","||=>or"] //指定需要映射的字符
    4. }
    }, //指定分词器内的字符过滤器为我们上面定义好的字符过滤器“my-charfilter” “analyzer”: {
    1. "my-analyzer":{ //指定分词器的名字
    2. "tokenizer":"keyword",
    3. "char_filter":["my-charfilter"] //指定要使用的字符过滤器
    4. }
    } } } }

测试 my-index的字符过滤器my-charfilter

GET my-index/_analyze { “analyzer”: “my-analyzer”, “text”: [“666&77&&8||”] }

结果:

{ “tokens” : [ { “token” : “666and77and8or”, “start_offset” : 0, “end_offset” : 11, “type” : “word”, “position” : 0 } ] }

  1. 2. 新增索引名称为“my-index”,并指定分词器内的字符过滤器为“html_strip”,并指定逃逸标签
  2. ```json
  3. # 新增一个索引名字为my-index2 并指定字符过滤器
  4. PUT my-index2
  5. {
  6. "settings": {
  7. "analysis": {
  8. //指定字符过滤器的名字为“my-charfilter”,类型为“html_strip”,逃逸标签为["a","p"]
  9. "char_filter": {
  10. "my-charfilter":{ //指定字符过滤器名字
  11. "type":"html_strip", //指定过滤器类型
  12. "escaped_tags":["a","p"] //指定字符过滤器逃逸标签
  13. }
  14. },
  15. //指定词法分分词器的名字为my-analyzer,字符过滤器为我们上面定义好的my-charfilter
  16. "analyzer": {
  17. "my-analyzer":{ //指定分词器的名字
  18. "tokenizer":"keyword",
  19. "char_filter":["my-charfilter"] //指定要使用的字符过滤器
  20. }
  21. }
  22. }
  23. }
  24. }
  25. # 测试
  26. GET my-index2/_analyze
  27. {
  28. "analyzer": "my-analyzer",
  29. "text": ["<a><b>hello word</b></a>", "<p>hello word</p>"]
  30. }
  31. # 结果:
  32. {
  33. "tokens" : [
  34. {
  35. "token" : "666and77and8or",
  36. "start_offset" : 0,
  37. "end_offset" : 11,
  38. "type" : "word",
  39. "position" : 0
  40. }
  41. ]
  42. }
  1. 用正则匹配,并替换指定的字符 ```json

    用正则匹配,并替换指定的字符

    PUT my-index3 { “settings”: { “analysis”: { //设置字符过滤器 “char_filter”: {
    1. "my-charfilter":{ //指定字符过滤器名字
    2. "type":"pattern_replace", //指定过滤器类型
    3. "pattern":"(\\d+)-(?=\\d)", //指定正则
    4. "replacement":"$1_" //指定匹配后的替换字符
    5. }
    }, //指定词法分分词器的名字为my-analyzer,字符过滤器为我们上面定义好的my-charfilter “analyzer”: {
    1. "my-analyzer":{ //指定分词器的名字
    2. "tokenizer":"standard",
    3. "char_filter":["my-charfilter"] //指定要使用的字符过滤器
    4. }
    } } } }

测试

GET my-index3/_analyze { “analyzer”: “my-analyzer”, “text”: “My credit card is 123-456-789” }

结果: { “tokens” : [ { “token” : “My”, “start_offset” : 0, “end_offset” : 2, “type” : ““, “position” : 0 }, { “token” : “credit”, “start_offset” : 3, “end_offset” : 9, “type” : ““, “position” : 1 }, { “token” : “card”, “start_offset” : 10, “end_offset” : 14, “type” : ““, “position” : 2 }, { “token” : “is”, “start_offset” : 15, “end_offset” : 17, “type” : ““, “position” : 3 }, { “token” : “123_456_789”, “start_offset” : 18, “end_offset” : 29, “type” : ““, “position” : 4 } ] }

  1. <a name="sO5KC"></a>
  2. ## 6.2 设置token filter
  3. 什么都不带的
  4. ```json
  5. #测试:
  6. GET _analyze
  7. {
  8. "tokenizer" : "standard", //标准tokenizer
  9. "filter" : ["lowercase"], //小写过滤
  10. "text" : "THE Quick FoX JUMPs" //测试内容
  11. }
  12. # 结果:
  13. {
  14. "tokens" : [
  15. {
  16. "token" : "the",
  17. "start_offset" : 0,
  18. "end_offset" : 3,
  19. "type" : "<ALPHANUM>",
  20. "position" : 0
  21. },
  22. {
  23. "token" : "quick",
  24. "start_offset" : 4,
  25. "end_offset" : 9,
  26. "type" : "<ALPHANUM>",
  27. "position" : 1
  28. },
  29. {
  30. "token" : "fox",
  31. "start_offset" : 10,
  32. "end_offset" : 13,
  33. "type" : "<ALPHANUM>",
  34. "position" : 2
  35. },
  36. {
  37. "token" : "jumps",
  38. "start_offset" : 14,
  39. "end_offset" : 19,
  40. "type" : "<ALPHANUM>",
  41. "position" : 3
  42. }
  43. ]
  44. }

带脚本的

  1. #测试
  2. GET /_analyze
  3. {
  4. "tokenizer" : "standard", //标准tokenizer
  5. "filter": [
  6. {
  7. "type": "condition", //条件过滤器
  8. "filter": [ "lowercase" ], //转换为小写
  9. "script": { //条件脚本,满足脚本内的条件即会进行转换
  10. "source": "token.getTerm().length() < 5"
  11. }
  12. }
  13. ],
  14. "text": "THE QUICK BROWN FOX" //测试内容
  15. }
  16. 结果:
  17. {
  18. "tokens" : [
  19. {
  20. "token" : "the",
  21. "start_offset" : 0,
  22. "end_offset" : 3,
  23. "type" : "<ALPHANUM>",
  24. "position" : 0
  25. },
  26. {
  27. "token" : "QUICK",
  28. "start_offset" : 4,
  29. "end_offset" : 9,
  30. "type" : "<ALPHANUM>",
  31. "position" : 1
  32. },
  33. {
  34. "token" : "BROWN",
  35. "start_offset" : 10,
  36. "end_offset" : 15,
  37. "type" : "<ALPHANUM>",
  38. "position" : 2
  39. },
  40. {
  41. "token" : "fox",
  42. "start_offset" : 16,
  43. "end_offset" : 19,
  44. "type" : "<ALPHANUM>",
  45. "position" : 3
  46. }
  47. ]
  48. }

6.3 设置analyzer

停用词

  1. PUT /my_index8
  2. {
  3. "settings": {
  4. "analysis": {
  5. //设置分词器
  6. "analyzer": {
  7. "my_analyzer":{ //名字为“my_analyzer
  8. "type":"standard", //类型为标准类型
  9. "stopwords":"_english_" //停用词为英语语法里面的标准停用词
  10. }
  11. }
  12. }
  13. }
  14. }
  15. #测试
  16. GET my_index8/_analyze
  17. {
  18. "analyzer": "my_analyzer",
  19. "text": "Teacher Ma is in the restroom"
  20. }
  21. #结果:
  22. {
  23. "tokens" : [
  24. {
  25. "token" : "teacher",
  26. "start_offset" : 0,
  27. "end_offset" : 7,
  28. "type" : "<ALPHANUM>",
  29. "position" : 0
  30. },
  31. {
  32. "token" : "ma",
  33. "start_offset" : 8,
  34. "end_offset" : 10,
  35. "type" : "<ALPHANUM>",
  36. "position" : 1
  37. },
  38. {
  39. "token" : "restroom",
  40. "start_offset" : 21,
  41. "end_offset" : 29,
  42. "type" : "<ALPHANUM>",
  43. "position" : 5
  44. }
  45. ]
  46. }

6.4 设置analysis

  1. PUT my-index9
  2. {
  3. "settings": {
  4. "analysis": {
  5. //设置字符过滤器
  6. "char_filter": {
  7. "test_char_filter": { //字符过滤器名字为“test_char_filter
  8. "type": "mapping", //类型为mapping
  9. "mappings": [ "& => and", "| => or" ]
  10. }
  11. },
  12. //设置过滤器
  13. "filter": {
  14. "test_stopwords": { //过滤器名字为test_stopwords
  15. "type": "stop", //类型为stop
  16. "stopwords": ["is","in","at","the","a","for"]
  17. }
  18. },
  19. //设置分词器
  20. "tokenizer": {
  21. "punctuation": {
  22. "type": "pattern", //类型为正则匹配型
  23. "pattern": "[ .,!?]" //正则
  24. }
  25. },
  26. "analyzer": {
  27. "my_analyzer": { //分析器名字为my_analyzer
  28. "type": "custom", //设置typecustom告诉Elasticsearch我们正在定义一个定制分析器
  29. "char_filter": [ //设置字符过滤器一个是我们自定义的test_char_filter,一个是内置的html_strip
  30. "test_char_filter",
  31. "html_strip"
  32. ],
  33. "tokenizer": "standard", //设置tokenizerstandard
  34. "filter": ["lowercase","test_stopwords"]//设置过滤器为内置的lowercase和我们自定义的test_stopwords
  35. }
  36. }
  37. }
  38. }
  39. }
  40. # 测试
  41. GET my-index9/_analyze
  42. {
  43. "text": "Teacher ma & zhang also thinks [mother's friends] is good | nice!!!",
  44. "analyzer": "my_analyzer"
  45. }
  46. #结果
  47. {
  48. "tokens" : [
  49. {
  50. "token" : "teacher",
  51. "start_offset" : 0,
  52. "end_offset" : 7,
  53. "type" : "<ALPHANUM>",
  54. "position" : 0
  55. },
  56. {
  57. "token" : "ma",
  58. "start_offset" : 8,
  59. "end_offset" : 10,
  60. "type" : "<ALPHANUM>",
  61. "position" : 1
  62. },
  63. {
  64. "token" : "and",
  65. "start_offset" : 11,
  66. "end_offset" : 12,
  67. "type" : "<ALPHANUM>",
  68. "position" : 2
  69. },
  70. {
  71. "token" : "zhang",
  72. "start_offset" : 13,
  73. "end_offset" : 18,
  74. "type" : "<ALPHANUM>",
  75. "position" : 3
  76. },
  77. {
  78. "token" : "also",
  79. "start_offset" : 19,
  80. "end_offset" : 23,
  81. "type" : "<ALPHANUM>",
  82. "position" : 4
  83. },
  84. {
  85. "token" : "thinks",
  86. "start_offset" : 24,
  87. "end_offset" : 30,
  88. "type" : "<ALPHANUM>",
  89. "position" : 5
  90. },
  91. {
  92. "token" : "mother's",
  93. "start_offset" : 32,
  94. "end_offset" : 40,
  95. "type" : "<ALPHANUM>",
  96. "position" : 6
  97. },
  98. {
  99. "token" : "friends",
  100. "start_offset" : 41,
  101. "end_offset" : 48,
  102. "type" : "<ALPHANUM>",
  103. "position" : 7
  104. },
  105. {
  106. "token" : "good",
  107. "start_offset" : 53,
  108. "end_offset" : 57,
  109. "type" : "<ALPHANUM>",
  110. "position" : 9
  111. },
  112. {
  113. "token" : "or",
  114. "start_offset" : 58,
  115. "end_offset" : 59,
  116. "type" : "<ALPHANUM>",
  117. "position" : 10
  118. },
  119. {
  120. "token" : "nice",
  121. "start_offset" : 60,
  122. "end_offset" : 64,
  123. "type" : "<ALPHANUM>",
  124. "position" : 11
  125. }
  126. ]
  127. }

6.5 完整的mapping+analysis

  1. PUT my_index12
  2. {
  3. "settings": {
  4. "analysis": {
  5. //设置字符过滤器
  6. "char_filter": {
  7. "test_char_filter": { //字符过滤器名字为“test_char_filter
  8. "type": "mapping", //类型为mapping
  9. "mappings": [ "& => and", "| => or" ]
  10. }
  11. },
  12. //设置过滤器
  13. "filter": {
  14. "test_stopwords": { //过滤器名字为test_stopwords
  15. "type": "stop", //类型为stop
  16. "stopwords": ["is","in","at","the","a","for"]
  17. }
  18. },
  19. //设置分词器
  20. "tokenizer": {
  21. "punctuation": {
  22. "type": "pattern", //类型为正则匹配型
  23. "pattern": "[ .,!?]" //正则
  24. }
  25. },
  26. "analyzer": {
  27. "my_analyzer": { //分析器名字为my_analyzer
  28. "type": "custom", //设置typecustom告诉Elasticsearch我们正在定义一个定制分析器
  29. "char_filter": [ //设置字符过滤器一个是我们自定义的test_char_filter,一个是内置的html_strip
  30. "test_char_filter",
  31. "html_strip"
  32. ],
  33. "tokenizer": "standard", //设置tokenizerstandard
  34. "filter": ["lowercase","test_stopwords"]//设置过滤器为内置的lowercase和我们自定义的test_stopwords
  35. }
  36. }
  37. }
  38. },
  39. "mappings": {
  40. "properties": {
  41. //设置字段“name
  42. "name": {
  43. "type": "text", //设置类型为文本类型
  44. "analyzer": "my_analyzer", //指定分词器
  45. "search_analyzer": "standard" //设置搜索时分词器
  46. },
  47. //设置字段“said
  48. "said": {
  49. "type": "text", //设置类型为文本类型
  50. "analyzer": "my_analyzer", //指定分词器
  51. "search_analyzer": "standard" //设置搜索时分词器
  52. }
  53. }
  54. }
  55. }

6.6 中文分词器 ik

1. ik安装并使用

ik是一个java项目,使用maven构建,我们把它clone下来,使用maven命令package。在target/release/下面有个zip文件。elasticsearch-analysis-ik-7.4.0.zip
图片.png
把 elasticsearch-analysis-ik-7.4.0.zip解压后是这个样子
图片.png
将这些文件全部copy至{``ES_source_path``}``/plugins/ik文件夹下面(需要在plugins下面创建ik文件夹)。
图片.png
然后重启Elasticsearch。

测试 ik中文分词器 :

  1. # 测试 ik中文分词器
  2. GET _analyze
  3. {
  4. "analyzer": "ik_max_word", //指定分词器为ik内置的两大分词器之一“ik_max_word
  5. "text": [ "美国留给伊拉克的是个烂摊子吗" ] //测试数据
  6. }
  7. # 结果
  8. {
  9. "tokens" : [
  10. {
  11. "token" : "美国",
  12. "start_offset" : 0,
  13. "end_offset" : 2,
  14. "type" : "CN_WORD",
  15. "position" : 0
  16. },
  17. {
  18. "token" : "留给",
  19. "start_offset" : 2,
  20. "end_offset" : 4,
  21. "type" : "CN_WORD",
  22. "position" : 1
  23. },
  24. {
  25. "token" : "伊拉克",
  26. "start_offset" : 4,
  27. "end_offset" : 7,
  28. "type" : "CN_WORD",
  29. "position" : 2
  30. },
  31. {
  32. "token" : "的",
  33. "start_offset" : 7,
  34. "end_offset" : 8,
  35. "type" : "CN_CHAR",
  36. "position" : 3
  37. },
  38. {
  39. "token" : "是",
  40. "start_offset" : 8,
  41. "end_offset" : 9,
  42. "type" : "CN_CHAR",
  43. "position" : 4
  44. },
  45. {
  46. "token" : "个",
  47. "start_offset" : 9,
  48. "end_offset" : 10,
  49. "type" : "CN_CHAR",
  50. "position" : 5
  51. },
  52. {
  53. "token" : "烂摊子",
  54. "start_offset" : 10,
  55. "end_offset" : 13,
  56. "type" : "CN_WORD",
  57. "position" : 6
  58. },
  59. {
  60. "token" : "摊子",
  61. "start_offset" : 11,
  62. "end_offset" : 13,
  63. "type" : "CN_WORD",
  64. "position" : 7
  65. },
  66. {
  67. "token" : "吗",
  68. "start_offset" : 13,
  69. "end_offset" : 14,
  70. "type" : "CN_CHAR",
  71. "position" : 8
  72. }
  73. ]
  74. }

安装过程中可能遇到的问题:

java.lang.IllegalArgumentException: Plugin [analysis-ik] was built for Elasticsearch version 7.10.0 but version 7.10.1 is running

一看就是ik和Elasticsearch版本不匹配,我们需要把ik分词器内的Elasticsearch版本改为ik对应的版本。
编辑plugin-descriptor.properties文件:里面有个“elasticsearch.version” 将它改为Elasticsearch版本即可。
图片.png

2. ik两种analyzer

ik_max_word:细粒度,一般都使用这个。
ik_smart:粗粒度

3. config下的文件描述

  • IKAnalyzer.cfg.xml : IK分词配置文件
  • main.dic :主词库
  • stopword.dic :英文停用词,不会建立在倒排索引中
  • quantifier.dic :特殊词库:计量单位等
  • suffix.dic :特殊词库:后缀名
  • a. surname.dic :特殊词库:百家姓
  • preposition :特殊词库:语气词

4. 如何热更新ik词库

方案一:修改IK源码,定时链接数据库,查询主词和停用词,更新到本机IK中。

方案二:定时调接口
目前该插件支持热更新 IK 分词,通过在 IKAnalyzer.cfg.xml 配置文件中提到的如下配置
图片.png

其中 words_location 是指一个 URL接口地址,该请求只需满足以下两点即可完成分词热更新。

  1. 该 http 请求需要返回两个头部(header),一个是 **Last-Modified**,一个是 **ETag**,这两者都是字符串类型,只要有一个发生变化,该插件就会去抓取新的分词进而更新词库。
  2. 该 http 请求返回的内容格式是**一行一个分词**,换行符用** \n** 即可。

满足上面两点要求就可以实现热更新分词了,不需要重启 ES 实例。

下面是用spring boot模拟的一个接口:

  1. /**
  2. * 获取ik自定义词典
  3. * @param response
  4. */
  5. @RequestMapping(value="/getCustomDict")
  6. public void getCustomDict(HttpServletResponse response){
  7. try {
  8. //具体数据可以连接数据库或者缓存类获取
  9. String[] dicWords = new String[]{"雷猴啊","割韭菜","996","实力装X","颜值担当"};
  10. //响应数据
  11. StringJoiner content = new StringJoiner(System.lineSeparator());
  12. for (String dicWord : dicWords) {
  13. content.add(dicWord);
  14. }
  15. response.setHeader("Last-Modified", String.valueOf(content.length()));
  16. response.setHeader("ETag",String.valueOf(content.length()));
  17. response.setContentType("text/plain; charset=utf-8");
  18. OutputStream out= response.getOutputStream();
  19. out.write(content.toString().getBytes("UTF-8"));
  20. out.flush();
  21. } catch (Exception e) {
  22. e.printStackTrace();
  23. }
  24. }

7. 前缀、通配符、正则、模糊查询

  1. # 设置前缀索引
  2. PUT my_index
  3. {
  4. "mappings": {
  5. "properties": {
  6. "title":{ //字段名字
  7. "type": "text", //类型为文本类型
  8. "index_prefixes":{ //索引前缀设置
  9. "min_chars":2, //字段前缀最少2个字符生成索引
  10. "max_chars":4 //字段前缀最多4个字符生成索引
  11. }
  12. }
  13. }
  14. }
  15. }
  16. # 批量插入中文测试数据
  17. POST /my_index/_bulk
  18. { "index": { "_id": "1"} }
  19. { "title": "城管打电话喊商贩去摆摊摊" }
  20. { "index": { "_id": "2"} }
  21. { "title": "笑果文化回应商贩老农去摆摊" }
  22. { "index": { "_id": "3"} }
  23. { "title": "老农耗时17年种出椅子树" }
  24. { "index": { "_id": "4"} }
  25. { "title": "夫妻结婚30多年AA制,被城管抓" }
  26. { "index": { "_id": "5"} }
  27. { "title": "黑人见义勇为阻止抢劫反被铐住" }
  28. # 批量插入英文测试数据
  29. POST /my_index/_bulk
  30. { "index": { "_id": "6"} }
  31. { "title": "my english" }
  32. { "index": { "_id": "7"} }
  33. { "title": "my english is good" }
  34. { "index": { "_id": "8"} }
  35. { "title": "my chinese is good" }
  36. { "index": { "_id": "9"} }
  37. { "title": "my japanese is nice" }
  38. { "index": { "_id": "10"} }
  39. { "title": "my disk is full" }
  40. #前缀搜索
  41. GET my_index/_search
  42. {
  43. "query": {
  44. "prefix": { //prefix为前缀搜索
  45. "title": { //指定需要前缀搜索的字段名称
  46. "value": "城管" //指定字段值
  47. }
  48. }
  49. }
  50. }
  51. # 测试ik分词器
  52. GET /_analyze
  53. {
  54. "text": "城管打电话喊商贩去摆摊摊",
  55. "analyzer": "ik_smart"
  56. }
  57. #通配符
  58. GET my_index/_search
  59. {
  60. "query": {
  61. "wildcard": {
  62. "title": {
  63. "value": "eng?ish"
  64. }
  65. }
  66. }
  67. }
  68. #通配符
  69. GET order/_search
  70. {
  71. "query": {
  72. "wildcard": {
  73. "insuranceCompany.keyword": {
  74. "value": "中?联合"
  75. }
  76. }
  77. }
  78. }
  79. #通配符
  80. GET product/_search
  81. {
  82. "query": {
  83. "wildcard": {
  84. "name.keyword": {
  85. "value": "xiaomi*nfc*",
  86. "boost": 1.0
  87. }
  88. }
  89. }
  90. }
  91. #正则
  92. GET product/_search
  93. {
  94. "query": {
  95. "regexp": {
  96. "name": {
  97. "value": "[\\s\\S]*nfc[\\s\\S]*",
  98. "flags": "ALL",
  99. "max_determinized_states": 10000,
  100. "rewrite": "constant_score"
  101. }
  102. }
  103. }
  104. }
  105. # 模糊搜索fuzzy
  106. GET order/_search
  107. {
  108. "query": {
  109. "match": {
  110. "tenderAddress": { //指定需要搜索的字段的名字
  111. "query": "乔呗", //指定需要搜索的模糊值
  112. "fuzziness": "AUTO", //指定模糊模式,默认“AUTO
  113. "operator": "or" //指定操作类型,默认“or
  114. }
  115. }
  116. }
  117. }
  118. # 前缀短语搜索match_phrase_prefix
  119. GET /order/_search
  120. {
  121. "query": {
  122. //前缀短语搜索
  123. "match_phrase_prefix": {
  124. "bidInvationCenter": { //指定字段
  125. "query": "中国", //前缀短语值
  126. "analyzer": "ik_smart", //设置分词器
  127. "max_expansions": 1, //限制匹配的最大词项,参数告诉 match_phrase 查询词条相隔多远时仍然能将文档视为匹配 什么是相隔多远? 意思是说为了让查询和文档匹配你需要移动词条多少次?
  128. "slop": 2, //允许短语间的词项(term)间隔
  129. "boost": 1 //用于设置该查询的权重
  130. }
  131. }
  132. }
  133. }
  134. GET _search
  135. {
  136. "query": {
  137. "match_phrase_prefix": {
  138. "message": {
  139. "query": "quick brown f"
  140. }
  141. }
  142. }
  143. }

8. TF-IDF算法、Java API序幕

官网的资料才是最新的,最全的:https://www.elastic.co/guide/en/elasticsearch/client/java-rest/current/java-rest-overview.html

引入maven

  1. <dependency>
  2. <groupId>org.elasticsearch.client</groupId>
  3. <artifactId>elasticsearch-rest-high-level-client</artifactId>
  4. <version>7.11.1</version>
  5. </dependency>
  6. <repositories>
  7. <repository>
  8. <id>es-snapshots</id>
  9. <name>elasticsearch snapshot repo</name>
  10. <url>https://snapshots.elastic.co/maven/</url>
  11. </repository>
  12. </repositories>
  13. <repository>
  14. <id>elastic-lucene-snapshots</id>
  15. <name>Elastic Lucene Snapshots</name>
  16. <url>https://s3.amazonaws.com/download.elasticsearch.org/lucenesnapshots/83f9835</url>
  17. <releases><enabled>true</enabled></releases>
  18. <snapshots><enabled>false</enabled></snapshots>
  19. </repository>

8.1 基本增删改查

  1. import org.apache.http.HttpHost;
  2. import org.elasticsearch.action.admin.indices.create.CreateIndexRequest;
  3. import org.elasticsearch.action.admin.indices.create.CreateIndexResponse;
  4. import org.elasticsearch.action.admin.indices.delete.DeleteIndexRequest;
  5. import org.elasticsearch.action.admin.indices.delete.DeleteIndexResponse;
  6. import org.elasticsearch.action.bulk.BulkRequest;
  7. import org.elasticsearch.action.bulk.BulkResponse;
  8. import org.elasticsearch.action.delete.DeleteRequest;
  9. import org.elasticsearch.action.delete.DeleteResponse;
  10. import org.elasticsearch.action.get.GetRequest;
  11. import org.elasticsearch.action.get.GetResponse;
  12. import org.elasticsearch.action.index.IndexRequest;
  13. import org.elasticsearch.action.index.IndexResponse;
  14. import org.elasticsearch.action.update.UpdateRequest;
  15. import org.elasticsearch.action.update.UpdateResponse;
  16. import org.elasticsearch.client.RequestOptions;
  17. import org.elasticsearch.client.RestClient;
  18. import org.elasticsearch.client.RestHighLevelClient;
  19. import org.elasticsearch.common.settings.Settings;
  20. import org.elasticsearch.common.xcontent.XContentType;
  21. import java.io.IOException;
  22. import java.util.HashMap;
  23. import java.util.Map;
  24. public class TestElasticsearch {
  25. public static void main(String[] args) throws IOException {
  26. RestHighLevelClient client = createClient();//创建链接,创建客户端
  27. try {
  28. // createIndex(client); //创建索引
  29. // addData(client); //添加数据
  30. // queryData(client); //查询数据
  31. // updateData(client); //修改数据
  32. // queryData(client); //查询数据
  33. // deleteData(client); //删除数据
  34. // deleteIndex(client); //删除索引
  35. bulk(client); //批量操作
  36. }catch (Exception e){
  37. e.printStackTrace();
  38. }finally {
  39. if (client != null){
  40. closeClient(client); //关闭客户端
  41. }
  42. }
  43. }
  44. /**
  45. * 创建链接,创建客户端
  46. * @return
  47. */
  48. public static RestHighLevelClient createClient(){
  49. //创建链接,创建客户端
  50. HttpHost http9200 = new HttpHost("8.140.122.156", 9200, "http");
  51. // HttpHost http9201 = new HttpHost("8.140.122.156", 9201, "http");
  52. RestHighLevelClient client = new RestHighLevelClient(RestClient.builder(http9200));
  53. return client;
  54. }
  55. /**
  56. * 关闭客户端
  57. * @param client
  58. * @throws IOException
  59. */
  60. public static void closeClient(RestHighLevelClient client) throws IOException {
  61. client.close();
  62. }
  63. /**
  64. * 创建索引
  65. * PUT order1
  66. * {
  67. * "settings": {
  68. * "number_of_shards": 3,
  69. * "number_of_replicas": 2
  70. * },
  71. * "mappings": {
  72. * "properties": {
  73. * "id":{
  74. * "type": "long"
  75. * },
  76. * "age":{
  77. * "type": "short"
  78. * },
  79. * "mail":{
  80. * "type": "text",
  81. * "fields": {
  82. * "keyword" : {
  83. * "type" : "keyword",
  84. * "ignore_above" : 64
  85. * }
  86. * }
  87. * },
  88. * "name":{
  89. * "type": "text",
  90. * "fields": {
  91. * "keyword" : {
  92. * "type" : "keyword",
  93. * "ignore_above" : 64
  94. * }
  95. * }
  96. * },
  97. * "createTime":{
  98. * "type": "date",
  99. * "format": ["yyyy-MM-dd HH:mm:ss"]
  100. * }
  101. * }
  102. * }
  103. * }
  104. * @param client
  105. * @throws IOException
  106. */
  107. public static void createIndex(RestHighLevelClient client) throws IOException {
  108. //设置3个分片,每个分片2副本
  109. Settings settings = Settings.builder()
  110. .put("number_of_shards",3)
  111. .put("number_of_replicas",2)
  112. .build();
  113. //新增索引
  114. CreateIndexRequest createIndexRequest = new CreateIndexRequest("order1");
  115. createIndexRequest.mapping("properties","{\"id\":{\"type\":\"long\"},\"age\":{\"type\":\"short\"},\"mail\":{\"type\":\"text\",\"fields\":{\"keyword\":{\"type\":\"keyword\",\"ignore_above\":64}}},\"name\":{\"type\":\"text\",\"fields\":{\"keyword\":{\"type\":\"keyword\",\"ignore_above\":64}}},\"createTime\":{\"type\":\"date\",\"format\":[\"yyyy-MM-dd HH:mm:ss\"]}}",XContentType.JSON);
  116. createIndexRequest.settings(settings);
  117. //执行
  118. CreateIndexResponse createIndexResponse = client.indices().create(createIndexRequest, RequestOptions.DEFAULT);
  119. System.out.println("创建索引结果: "+createIndexResponse.isAcknowledged());
  120. }
  121. /**
  122. * 插入数据
  123. *
  124. * @param client
  125. * @throws IOException
  126. */
  127. public static void addData(RestHighLevelClient client) throws IOException {
  128. //数据
  129. Map personMap = new HashMap();
  130. personMap.put("id",1);
  131. personMap.put("age",28);
  132. personMap.put("name","王帆");
  133. personMap.put("email","wangfan@yxinsur.com");
  134. personMap.put("createTime","2020-02-02 12:00:00");
  135. //新增索引
  136. IndexRequest indexRequest = new IndexRequest("order1","_doc","1");
  137. indexRequest.source(personMap);
  138. //执行
  139. IndexResponse indexResponse = client.index(indexRequest, RequestOptions.DEFAULT);
  140. System.out.println("插入数据结果: "+indexResponse.toString());
  141. }
  142. /**
  143. * 查询数据
  144. * GET order1/_doc/1
  145. * @param client
  146. * @throws IOException
  147. */
  148. public static void queryData(RestHighLevelClient client) throws IOException {
  149. //查询索引
  150. GetRequest getRequest = new GetRequest("order1","_doc","1");
  151. //执行
  152. GetResponse getResponse = client.get(getRequest, RequestOptions.DEFAULT);
  153. System.out.println("查询数据结果: "+getResponse.toString());
  154. }
  155. /**
  156. * 修改数据
  157. * PUT order1/_doc/1
  158. * {
  159. * "name": "王帆",
  160. * "age": 29,
  161. * "email": "wangfan@yxinsur.com"
  162. * }
  163. * @param client
  164. * @throws IOException
  165. */
  166. public static void updateData(RestHighLevelClient client) throws IOException {
  167. //实际数据
  168. Map<String,Object> personMap = new HashMap<>();
  169. personMap.put("id",1);
  170. personMap.put("age",29);
  171. personMap.put("name","王帆");
  172. personMap.put("email","wangfan@yxinsur.com");
  173. personMap.put("createTime","2020-02-02 12:00:00");
  174. //修改索引
  175. UpdateRequest updateRequest = new UpdateRequest("order1","_doc","1");
  176. updateRequest.doc(personMap);
  177. //或者是upsert
  178. //updateRequest.upsert(person);
  179. //执行
  180. UpdateResponse updateResponse = client.update(updateRequest, RequestOptions.DEFAULT);
  181. System.out.println("修改数据结果: "+updateResponse.toString());
  182. }
  183. /**
  184. * 删除数据
  185. * DELETE order1/_doc/1
  186. * @param client
  187. * @throws IOException
  188. */
  189. public static void deleteData(RestHighLevelClient client) throws IOException {
  190. //删除数据
  191. DeleteRequest deleteRequest = new DeleteRequest("order1","_doc","1");
  192. //执行
  193. DeleteResponse deleteResponse = client.delete(deleteRequest, RequestOptions.DEFAULT);
  194. System.out.println("删除数据结果: "+deleteResponse.toString());
  195. }
  196. /**
  197. * 删除索引
  198. * DELETE order1/_doc/1
  199. * @param client
  200. * @throws IOException
  201. */
  202. public static void deleteIndex(RestHighLevelClient client) throws IOException {
  203. //删除索引
  204. DeleteIndexRequest deleteIndexRequest = new DeleteIndexRequest("order1");
  205. //执行
  206. DeleteIndexResponse deleteIndexResponse = client.indices().delete(deleteIndexRequest, RequestOptions.DEFAULT);
  207. System.out.println("删除索引结果: "+deleteIndexResponse.isAcknowledged());
  208. }
  209. /**
  210. * 批量操作
  211. * @param client
  212. */
  213. public static void bulk(RestHighLevelClient client) throws IOException {
  214. //准备数据
  215. Map<String,Object> personMap = new HashMap<>();
  216. personMap.put("id",2);
  217. personMap.put("age",29);
  218. personMap.put("name","王帆");
  219. personMap.put("email","wangfan@yxinsur.com");
  220. personMap.put("createTime","2020-02-02 12:00:00");
  221. //插入数据操作
  222. IndexRequest indexRequest2 = new IndexRequest("order1","_doc","2");
  223. indexRequest2.source(personMap,XContentType.JSON);
  224. IndexRequest indexRequest3 = new IndexRequest("order1","_doc","3");
  225. indexRequest3.source(personMap,XContentType.JSON);
  226. //修改数据操作
  227. personMap.put("age",30);
  228. UpdateRequest updateRequest = new UpdateRequest("order1","_doc","2");
  229. updateRequest.doc(personMap,XContentType.JSON);
  230. //删除数据操作
  231. DeleteRequest deleteRequest = new DeleteRequest("order1","_doc","2");
  232. //将操作都放入bulk里面
  233. BulkRequest bulkRequest = new BulkRequest();
  234. bulkRequest.add(indexRequest2);
  235. bulkRequest.add(indexRequest3);
  236. bulkRequest.add(updateRequest);
  237. bulkRequest.add(deleteRequest);
  238. //执行
  239. BulkResponse bulkResponse = client.bulk(bulkRequest, RequestOptions.DEFAULT);
  240. System.out.println("批量操作结果: "+bulkResponse.toString());
  241. }
  242. }

创建索引结果: true插入数据结果: IndexResponse[index=order1,type=_doc,id=1,version=1,result=created,seqNo=0,primaryTerm=1,shards={“total”:3,”successful”:1,”failed”:0}] 查询数据结果: {“_index”:”order1”,”_type”:”_doc”,”_id”:”1”,”_version”:1,”found”:true,”_source”:{“createTime”:”2020-02-02 12:00:00”,”name”:”王帆”,”id”:1,”age”:28,”email”:”wangfan@yxinsur.com”},”fields”:{“_seq_no”:[0],”_primary_term”:[1]}}修改数据结果: UpdateResponse[index=order1,type=_doc,id=1,version=2,seqNo=1,primaryTerm=1,result=updated,shards=ShardInfo{total=3, successful=1, failures=[]}]查询数据结果: {“_index”:”order1”,”_type”:”_doc”,”_id”:”1”,”_version”:2,”found”:true,”_source”:{“createTime”:”2020-02-02 12:00:00”,”name”:”王帆”,”id”:1,”age”:29,”email”:”wangfan@yxinsur.com”},”fields”:{“_seq_no”:[1],”_primary_term”:[1]}}删除数据结果: DeleteResponse[index=order1,type=_doc,id=1,version=3,result=deleted,shards=ShardInfo{total=3, successful=1, failures=[]}]删除索引结果: true批量操作结果: org.elasticsearch.action.bulk.BulkResponse@793be5ca

9. ES集群部署

配置文件说明:

  • cluster.name:

整个集群的名称,整个集群使用一个名字,其他节点通过cluster.name发现集群。

  • node.name

集群内某个节点的名称,其他节点可通过node.name发现节点,默认是机器名。

  • path.data

数据存放地址,生产环境必须不能设置为es内部,否则es更新时会直接抹除数据

  • path.logs

日志存在地址,生产环境必须不能设置为es内部,否则es更新时会直接抹除数据

  • bootstrap.memory_lock

是否禁用swap交换区(swap交换区为系统内存不够时使用磁盘作为临时空间), 生产环境必须禁用

  • network.host

当前节点绑定的Ip地址,切记一旦指定,则必须是这个地址才能访问,比如:如果配置为,127.0.0.1, 则必须只能是127.0.0.1才能访问, localhost 也不行。

  • http.port

当前节点的服务端口号

  • transport.port

当前节点的通讯端口号,集群内节点之间通讯使用此端口号, 比如选举master节点时。

  • discovery.seed_hosts

必须把所有master节点的地址+端口都写进去。

  • cluster.initial_master_nodes:

集群初始化时会从这个列表中取出一个node.name选为master节点。

整个集群搭建模型:

节点名称 node.master node.data 描述
node01 true false master节点
node02 true false master节点
node03 true false master节点
node04 false true 纯数据节点
node05 false true 纯数据节点
node06 false false 仅投票节点,路由节点

机器内网ip:172.27.229.9
集器外网ip:8.140.122.156

node01 配置:

  1. cluster.name: my-application
  2. node.name: node01
  3. path.data: /root/soft/elasticsearch/data/node01
  4. path.logs: /root/soft/elasticsearch/log/node01
  5. bootstrap.memory_lock: false
  6. network.host: 172.27.229.9
  7. http.port: 9200
  8. transport.port: 9300
  9. discovery.seed_hosts: ["172.27.229.9:9300", "172.27.229.9:9301","172.27.229.9:9302",]
  10. cluster.initial_master_nodes: ["node01","node02","node03"]
  11. http.cors.enabled: true
  12. http.cors.allow-origin: "*"
  13. node.master: true
  14. node.data: false
  15. node.max_local_storage_nodes: 6

node02 配置:

  1. cluster.name: my-application
  2. node.name: node02
  3. path.data: /root/soft/elasticsearch/data/node02
  4. path.logs: /root/soft/elasticsearch/log/node02
  5. bootstrap.memory_lock: false
  6. network.host: 172.27.229.9
  7. http.port: 9201
  8. transport.port: 9301
  9. discovery.seed_hosts: ["172.27.229.9:9300", "172.27.229.9:9301","172.27.229.9:9302",]
  10. cluster.initial_master_nodes: ["node01","node02","node03"]
  11. http.cors.enabled: true
  12. http.cors.allow-origin: "*"
  13. node.master: true
  14. node.data: false
  15. node.max_local_storage_nodes: 6

node03 配置:

  1. cluster.name: my-application
  2. node.name: node03
  3. path.data: /root/soft/elasticsearch/data/node03
  4. path.logs: /root/soft/elasticsearch/log/node03
  5. bootstrap.memory_lock: false
  6. network.host: 172.27.229.9
  7. http.port: 9202
  8. transport.port: 9302
  9. discovery.seed_hosts: ["172.27.229.9:9300", "172.27.229.9:9301","172.27.229.9:9302",]
  10. cluster.initial_master_nodes: ["node01","node02","node03"]
  11. http.cors.enabled: true
  12. http.cors.allow-origin: "*"
  13. node.master: true
  14. node.data: false
  15. node.max_local_storage_nodes: 6

node04 配置:

  1. cluster.name: my-application
  2. node.name: node04
  3. path.data: /root/soft/elasticsearch/data/node04
  4. path.logs: /root/soft/elasticsearch/log/node04
  5. bootstrap.memory_lock: false
  6. network.host: 172.27.229.9
  7. http.port: 9203
  8. transport.port: 9303
  9. discovery.seed_hosts: ["172.27.229.9:9300", "172.27.229.9:9301","172.27.229.9:9302",]
  10. cluster.initial_master_nodes: ["node01","node02","node03"]
  11. http.cors.enabled: true
  12. http.cors.allow-origin: "*"
  13. node.master: false
  14. node.data: true
  15. node.max_local_storage_nodes: 6

node05 配置:

  1. cluster.name: my-application
  2. node.name: node05
  3. path.data: /root/soft/elasticsearch/data/node05
  4. path.logs: /root/soft/elasticsearch/log/node05
  5. bootstrap.memory_lock: false
  6. network.host: 172.27.229.9
  7. http.port: 9204
  8. transport.port: 9304
  9. discovery.seed_hosts: ["172.27.229.9:9300", "172.27.229.9:9301","172.27.229.9:9302",]
  10. cluster.initial_master_nodes: ["node01","node02","node03"]
  11. http.cors.enabled: true
  12. http.cors.allow-origin: "*"
  13. node.master: false
  14. node.data: true
  15. node.max_local_storage_nodes: 6

node05 配置:

  1. cluster.name: my-application
  2. node.name: node06
  3. path.data: /root/soft/elasticsearch/data/node06
  4. path.logs: /root/soft/elasticsearch/log/node06
  5. bootstrap.memory_lock: false
  6. network.host: 172.27.229.9
  7. http.port: 9205
  8. transport.port: 9305
  9. discovery.seed_hosts: ["172.27.229.9:9300", "172.27.229.9:9301","172.27.229.9:9302",]
  10. cluster.initial_master_nodes: ["node01","node02","node03"]
  11. http.cors.enabled: true
  12. http.cors.allow-origin: "*"
  13. node.master: false
  14. node.data: false
  15. node.max_local_storage_nodes: 6