出自图灵学院 白起老师的 ElasticSearch 课程 ,我学完了之后 操作完了做个了笔记,然后发了个博客
概述
思考一下,github中可以使用代码片段来实现数据搜索。这是如何实现的?
在github中也使用了ES来实现数据的全文搜索。其ES中有一个记录代码内容的索引,大致数据内容如下:
{"fileName" : "HelloWorld.java","authName" : "baiqi","authID" : 110,"productName" : "first-java","path" : "/com/baiqi/first","content" : "package com.baiqi.first; public class HelloWorld { //code... }"}
我们可以在github中通过代码的片段来实现数据的搜索。也可以使用其他条件实现数据搜索。但是,如果需要使用文件路径搜索内容应该如何实现?
此时相对会麻烦一点
比如说用户输入/com之后,需要给/com结果的路径找出来,但是有的用户可能忘了前缀了,就输入了中间的一段路径,比如说用户输入 /baiqi 那你需要把 es里面已经存在的 /com/baiqi/first 的路径的数据找出来.
这个时候需要为其中的字段path定义一个特殊的分词器。具体如下:
索引
PUT /codes{"settings": {"analysis": {"analyzer": {"path_analyzer": {"tokenizer": "path_hierarchy"}}}},"mappings": {"properties": {"fileName": {"type": "keyword"},"authName": {"type": "text","analyzer": "standard","fields": {"keyword": {"type": "keyword"}}},"authID": {"type": "long"},"productName": {"type": "text","analyzer": "standard","fields": {"keyword": {"type": "keyword"}}},"path": {"type": "text","analyzer": "path_analyzer","fields": {"keyword": {"type": "keyword"}}},"content": {"type": "text","analyzer": "standard"}}}}
注意:
“path_analyzer”: {
“tokenizer”: “path_hierarchy”
}
path_hierarchy这个分词策略是ElasticSearch专门针对路径形式的分词器,path_hierarchy分词器的特征是
path_hierarchy分词器的缺陷
GET /codes/_analyze{"text": "/a/b/c/d","field": "path"}
结果
{"tokens" : [{"token" : "/a","start_offset" : 0,"end_offset" : 2,"type" : "word","position" : 0},{"token" : "/a/b","start_offset" : 0,"end_offset" : 4,"type" : "word","position" : 0},{"token" : "/a/b/c","start_offset" : 0,"end_offset" : 6,"type" : "word","position" : 0},{"token" : "/a/b/c/d","start_offset" : 0,"end_offset" : 8,"type" : "word","position" : 0}]}
此时发现你如果搜索 /a 或者 /a/b 或者 /a/b/c 或者 /a/b/c/d 都能给 这个文档找出来,但是如果只是输入/b 或者是输入 /b/c 就不能给文档搜索出来了, 这就是个缺陷.
插入数据开始测试搜索发现缺陷
PUT /codes/_doc/1{"fileName": "HelloWorld.java","authName": "baiqi","authID": 110,"productName": "first-java","path": "/com/baiqi/first","content": "package com.baiqi.first; public class HelloWorld { // some code... }"}
测试搜索
搜索/com和/com/baiqi 都能搜索出来, 搜索结果我就不粘贴上了.
GET /codes/_search{"query": {"match": {"path": "/com"}}}
GET /codes/_search{"query": {"match": {"path": "/com/baiqi"}}}
发现缺陷
如果我搜索/baiqi ,此时发现搜不出来
GET /codes/_search{"query": {"match": {"path": "/baiqi"}}}
这就是一个问题了,明明索引库有/com/baiqi ,就因为我忘了前缀/com 就搜不出来东西,这样的用户体验是不太好的,我就希望输入 /baiqi的时候我也能搜出来.
解决这个缺陷
先删除/codes
DELETE /codes
创建索引,注意path字段的声明
“path”: {
“type”: “text”,
“analyzer”: “path_analyzer”,
“fields”: {
“keyword”: {
“type”: “text”, // 这个的意思是path字段按照standard分词器再次进行分词拆分
“analyzer”: “standard”
}
}
}
PUT /codes{"settings": {"analysis": {"analyzer": {"path_analyzer": {"tokenizer": "path_hierarchy"}}}},"mappings": {"properties": {"fileName": {"type": "keyword"},"authName": {"type": "text","analyzer": "standard","fields": {"keyword": {"type": "keyword"}}},"authID": {"type": "long"},"productName": {"type": "text","analyzer": "standard","fields": {"keyword": {"type": "keyword"}}},"path": {"type": "text","analyzer": "path_analyzer","fields": {"keyword": {"type": "text","analyzer": "standard"}}},"content": {"type": "text","analyzer": "standard"}}}}
插入数据
PUT /codes/_doc/1{"fileName": "HelloWorld.java","authName": "baiqi","authID": 110,"productName": "first-java","path": "/com/baiqi/first","content": "package com.baiqi.first; public class HelloWorld { // some code... }"}
测试查询数据,输入/baiqi能不能搜出来
GET /codes/_searchGET /codes/_search{"query": {"match": {"path.keyword": "/baiqi"}}}
结果:
发现能搜出来了
{"took" : 0,"timed_out" : false,"_shards" : {"total" : 1,"successful" : 1,"skipped" : 0,"failed" : 0},"hits" : {"total" : {"value" : 1,"relation" : "eq"},"max_score" : 0.2876821,"hits" : [{"_index" : "codes","_type" : "_doc","_id" : "1","_score" : 0.2876821,"_source" : {"fileName" : "HelloWorld.java","authName" : "baiqi","authID" : 110,"productName" : "first-java","path" : "/com/baiqi/first","content" : "package com.baiqi.first; public class HelloWorld { // some code... }"}}]}}
为什么输入path.keyword 才能搜出来
因为定义索引的时候path
“path”: {
“type”: “text”,
“analyzer”: “path_analyzer”,
“fields”: {
“keyword”: {
“type”: “text”, // 这个的意思是path字段按照standard分词器再次进行分词拆分
“analyzer”: “standard”
}
}
}
是这么配置的,
path的fields里面有个子字段keyword是根据standard进行分词的,
测试输入其它的能不能搜出来
“path.keyword”: “/com/baiqi”
“path.keyword”: “/baiqi/first”
“path.keyword”: “/com”
“path.keyword”: “/com/first”
上面这几种都能搜索出来
为什么 “path.keyword”: “/com/first” 也能查询出来呢?
因为path,keyword是text类型的,你输入/com/first的时候,其实是被分词了被分成了 /com /first 了
