出自图灵学院白起老师的 ElasticSearch 课程 ,我学完了之后操作完了做个了笔记,然后发了个博客

概述

思考一下，github中可以使用代码片段来实现数据搜索。这是如何实现的？

在github中也使用了ES来实现数据的全文搜索。其ES中有一个记录代码内容的索引，大致数据内容如下：

{
  "fileName" : "HelloWorld.java",
  "authName" : "baiqi",
  "authID" : 110,
  "productName" : "first-java",
  "path" : "/com/baiqi/first",
  "content" : "package com.baiqi.first; public class HelloWorld { //code... }"
}

我们可以在github中通过代码的片段来实现数据的搜索。也可以使用其他条件实现数据搜索。但是，如果需要使用文件路径搜索内容应该如何实现？

此时相对会麻烦一点

比如说用户输入/com之后,需要给/com结果的路径找出来,但是有的用户可能忘了前缀了,就输入了中间的一段路径,比如说用户输入 /baiqi 那你需要把 es里面已经存在的 /com/baiqi/first 的路径的数据找出来.

这个时候需要为其中的字段path定义一个特殊的分词器。具体如下：

索引

PUT /codes
{
  "settings": {
    "analysis": {
      "analyzer": {
        "path_analyzer": {
          "tokenizer": "path_hierarchy"
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "fileName": {
        "type": "keyword"
      },
      "authName": {
        "type": "text",
        "analyzer": "standard",
        "fields": {
          "keyword": {
            "type": "keyword"
          }
        }
      },
      "authID": {
        "type": "long"
      },
      "productName": {
        "type": "text",
        "analyzer": "standard",
        "fields": {
          "keyword": {
            "type": "keyword"
          }
        }
      },
      "path": {
        "type": "text",
        "analyzer": "path_analyzer",
        "fields": {
          "keyword": {
            "type": "keyword"
          }
        }
      },
      "content": {
        "type": "text",
        "analyzer": "standard"
      }
    }
  }
}

注意:

“path_analyzer”: {
“tokenizer”: “path_hierarchy”
}

path_hierarchy这个分词策略是ElasticSearch专门针对路径形式的分词器,path_hierarchy分词器的特征是

path_hierarchy分词器的缺陷

GET /codes/_analyze
{
  "text": "/a/b/c/d",
  "field": "path"
}

结果

{
  "tokens" : [
    {
      "token" : "/a",
      "start_offset" : 0,
      "end_offset" : 2,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "/a/b",
      "start_offset" : 0,
      "end_offset" : 4,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "/a/b/c",
      "start_offset" : 0,
      "end_offset" : 6,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "/a/b/c/d",
      "start_offset" : 0,
      "end_offset" : 8,
      "type" : "word",
      "position" : 0
    }
  ]
}

此时发现你如果搜索 /a 或者 /a/b 或者 /a/b/c 或者 /a/b/c/d 都能给这个文档找出来,但是如果只是输入/b 或者是输入 /b/c 就不能给文档搜索出来了, 这就是个缺陷.

插入数据开始测试搜索发现缺陷

PUT /codes/_doc/1
{
  "fileName": "HelloWorld.java",
  "authName": "baiqi",
  "authID": 110,
  "productName": "first-java",
  "path": "/com/baiqi/first",
  "content": "package com.baiqi.first; public class HelloWorld { // some code... }"
}

测试搜索

搜索/com和/com/baiqi 都能搜索出来, 搜索结果我就不粘贴上了.

GET /codes/_search
{
  "query": {
    "match": {
      "path": "/com"
    }
  }
}

GET /codes/_search
{
  "query": {
    "match": {
      "path": "/com/baiqi"
    }
  }
}

发现缺陷

如果我搜索/baiqi ,此时发现搜不出来

GET /codes/_search
{
  "query": {
    "match": {
      "path": "/baiqi"
    }
  }
}

这就是一个问题了,明明索引库有/com/baiqi ,就因为我忘了前缀/com 就搜不出来东西,这样的用户体验是不太好的,我就希望输入 /baiqi的时候我也能搜出来.

解决这个缺陷

先删除/codes

DELETE /codes

创建索引,注意path字段的声明

“path”: {
“type”: “text”,
“analyzer”: “path_analyzer”,
“fields”: {
“keyword”: {
“type”: “text”, // 这个的意思是path字段按照standard分词器再次进行分词拆分
“analyzer”: “standard”
}
}
}

PUT /codes
{
  "settings": {
    "analysis": {
      "analyzer": {
        "path_analyzer": {
          "tokenizer": "path_hierarchy"
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "fileName": {
        "type": "keyword"
      },
      "authName": {
        "type": "text",
        "analyzer": "standard",
        "fields": {
          "keyword": {
            "type": "keyword"
          }
        }
      },
      "authID": {
        "type": "long"
      },
      "productName": {
        "type": "text",
        "analyzer": "standard",
        "fields": {
          "keyword": {
            "type": "keyword"
          }
        }
      },
      "path": {
        "type": "text",
        "analyzer": "path_analyzer",
        "fields": {
          "keyword": {
            "type": "text",
            "analyzer": "standard"
          }
        }
      },
      "content": {
        "type": "text",
        "analyzer": "standard"
      }
    }
  }
}

插入数据

PUT /codes/_doc/1
{
  "fileName": "HelloWorld.java",
  "authName": "baiqi",
  "authID": 110,
  "productName": "first-java",
  "path": "/com/baiqi/first",
  "content": "package com.baiqi.first; public class HelloWorld { // some code... }"
}

测试查询数据,输入/baiqi能不能搜出来

GET /codes/_search
GET /codes/_search
{
  "query": {
    "match": {
      "path.keyword": "/baiqi"
    }
  }
}

结果:

发现能搜出来了

{
  "took" : 0,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 1,
      "relation" : "eq"
    },
    "max_score" : 0.2876821,
    "hits" : [
      {
        "_index" : "codes",
        "_type" : "_doc",
        "_id" : "1",
        "_score" : 0.2876821,
        "_source" : {
          "fileName" : "HelloWorld.java",
          "authName" : "baiqi",
          "authID" : 110,
          "productName" : "first-java",
          "path" : "/com/baiqi/first",
          "content" : "package com.baiqi.first; public class HelloWorld { // some code... }"
        }
      }
    ]
  }
}

为什么输入path.keyword 才能搜出来

因为定义索引的时候path

是这么配置的,

path的fields里面有个子字段keyword是根据standard进行分词的,

测试输入其它的能不能搜出来

“path.keyword”: “/com/baiqi”

“path.keyword”: “/baiqi/first”

“path.keyword”: “/com”

“path.keyword”: “/com/first”

上面这几种都能搜索出来

为什么 “path.keyword”: “/com/first” 也能查询出来呢?

因为path,keyword是text类型的,你输入/com/first的时候,其实是被分词了被分成了 /com /first 了

搜索引擎

文件系统的路径字段设置和搜索问题和解决

概述

索引