出自图灵学院 白起老师的 ElasticSearch 课程 ,我学完了之后 操作完了做个了笔记,然后发了个博客
概述
思考一下,github中可以使用代码片段来实现数据搜索。这是如何实现的?
在github中也使用了ES来实现数据的全文搜索。其ES中有一个记录代码内容的索引,大致数据内容如下:
{
"fileName" : "HelloWorld.java",
"authName" : "baiqi",
"authID" : 110,
"productName" : "first-java",
"path" : "/com/baiqi/first",
"content" : "package com.baiqi.first; public class HelloWorld { //code... }"
}
我们可以在github中通过代码的片段来实现数据的搜索。也可以使用其他条件实现数据搜索。但是,如果需要使用文件路径搜索内容应该如何实现?
此时相对会麻烦一点
比如说用户输入/com之后,需要给/com结果的路径找出来,但是有的用户可能忘了前缀了,就输入了中间的一段路径,比如说用户输入 /baiqi 那你需要把 es里面已经存在的 /com/baiqi/first 的路径的数据找出来.
这个时候需要为其中的字段path定义一个特殊的分词器。具体如下:
索引
PUT /codes
{
"settings": {
"analysis": {
"analyzer": {
"path_analyzer": {
"tokenizer": "path_hierarchy"
}
}
}
},
"mappings": {
"properties": {
"fileName": {
"type": "keyword"
},
"authName": {
"type": "text",
"analyzer": "standard",
"fields": {
"keyword": {
"type": "keyword"
}
}
},
"authID": {
"type": "long"
},
"productName": {
"type": "text",
"analyzer": "standard",
"fields": {
"keyword": {
"type": "keyword"
}
}
},
"path": {
"type": "text",
"analyzer": "path_analyzer",
"fields": {
"keyword": {
"type": "keyword"
}
}
},
"content": {
"type": "text",
"analyzer": "standard"
}
}
}
}
注意:
“path_analyzer”: {
“tokenizer”: “path_hierarchy”
}
path_hierarchy这个分词策略是ElasticSearch专门针对路径形式的分词器,path_hierarchy分词器的特征是
path_hierarchy分词器的缺陷
GET /codes/_analyze
{
"text": "/a/b/c/d",
"field": "path"
}
结果
{
"tokens" : [
{
"token" : "/a",
"start_offset" : 0,
"end_offset" : 2,
"type" : "word",
"position" : 0
},
{
"token" : "/a/b",
"start_offset" : 0,
"end_offset" : 4,
"type" : "word",
"position" : 0
},
{
"token" : "/a/b/c",
"start_offset" : 0,
"end_offset" : 6,
"type" : "word",
"position" : 0
},
{
"token" : "/a/b/c/d",
"start_offset" : 0,
"end_offset" : 8,
"type" : "word",
"position" : 0
}
]
}
此时发现你如果搜索 /a 或者 /a/b 或者 /a/b/c 或者 /a/b/c/d 都能给 这个文档找出来,但是如果只是输入/b 或者是输入 /b/c 就不能给文档搜索出来了, 这就是个缺陷.
插入数据开始测试搜索发现缺陷
PUT /codes/_doc/1
{
"fileName": "HelloWorld.java",
"authName": "baiqi",
"authID": 110,
"productName": "first-java",
"path": "/com/baiqi/first",
"content": "package com.baiqi.first; public class HelloWorld { // some code... }"
}
测试搜索
搜索/com和/com/baiqi 都能搜索出来, 搜索结果我就不粘贴上了.
GET /codes/_search
{
"query": {
"match": {
"path": "/com"
}
}
}
GET /codes/_search
{
"query": {
"match": {
"path": "/com/baiqi"
}
}
}
发现缺陷
如果我搜索/baiqi ,此时发现搜不出来
GET /codes/_search
{
"query": {
"match": {
"path": "/baiqi"
}
}
}
这就是一个问题了,明明索引库有/com/baiqi ,就因为我忘了前缀/com 就搜不出来东西,这样的用户体验是不太好的,我就希望输入 /baiqi的时候我也能搜出来.
解决这个缺陷
先删除/codes
DELETE /codes
创建索引,注意path字段的声明
“path”: {
“type”: “text”,
“analyzer”: “path_analyzer”,
“fields”: {
“keyword”: {
“type”: “text”, // 这个的意思是path字段按照standard分词器再次进行分词拆分
“analyzer”: “standard”
}
}
}
PUT /codes
{
"settings": {
"analysis": {
"analyzer": {
"path_analyzer": {
"tokenizer": "path_hierarchy"
}
}
}
},
"mappings": {
"properties": {
"fileName": {
"type": "keyword"
},
"authName": {
"type": "text",
"analyzer": "standard",
"fields": {
"keyword": {
"type": "keyword"
}
}
},
"authID": {
"type": "long"
},
"productName": {
"type": "text",
"analyzer": "standard",
"fields": {
"keyword": {
"type": "keyword"
}
}
},
"path": {
"type": "text",
"analyzer": "path_analyzer",
"fields": {
"keyword": {
"type": "text",
"analyzer": "standard"
}
}
},
"content": {
"type": "text",
"analyzer": "standard"
}
}
}
}
插入数据
PUT /codes/_doc/1
{
"fileName": "HelloWorld.java",
"authName": "baiqi",
"authID": 110,
"productName": "first-java",
"path": "/com/baiqi/first",
"content": "package com.baiqi.first; public class HelloWorld { // some code... }"
}
测试查询数据,输入/baiqi能不能搜出来
GET /codes/_search
GET /codes/_search
{
"query": {
"match": {
"path.keyword": "/baiqi"
}
}
}
结果:
发现能搜出来了
{
"took" : 0,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 1,
"relation" : "eq"
},
"max_score" : 0.2876821,
"hits" : [
{
"_index" : "codes",
"_type" : "_doc",
"_id" : "1",
"_score" : 0.2876821,
"_source" : {
"fileName" : "HelloWorld.java",
"authName" : "baiqi",
"authID" : 110,
"productName" : "first-java",
"path" : "/com/baiqi/first",
"content" : "package com.baiqi.first; public class HelloWorld { // some code... }"
}
}
]
}
}
为什么输入path.keyword 才能搜出来
因为定义索引的时候path
“path”: {
“type”: “text”,
“analyzer”: “path_analyzer”,
“fields”: {
“keyword”: {
“type”: “text”, // 这个的意思是path字段按照standard分词器再次进行分词拆分
“analyzer”: “standard”
}
}
}
是这么配置的,
path的fields里面有个子字段keyword是根据standard进行分词的,
测试输入其它的能不能搜出来
“path.keyword”: “/com/baiqi”
“path.keyword”: “/baiqi/first”
“path.keyword”: “/com”
“path.keyword”: “/com/first”
上面这几种都能搜索出来
为什么 “path.keyword”: “/com/first” 也能查询出来呢?
因为path,keyword是text类型的,你输入/com/first的时候,其实是被分词了被分成了 /com /first 了