全文检索

MongoDB自身的全文检索对中文支持不好,因为MongoDB建立全文索引时是词语建立的(不连续的字符) 因此需要使用ElasticSearch来实现 这里我们通过python的模块mongo-connector来同步mongo的数据到ES,再通过ES来进行查询 image.png

原理图
mongo-es.png

安装

安装elasticsearch

方式1:直接下载官方编译好的文件

https://github.com/elastic/elasticsearch https://www.elastic.co/downloads/elasticsearch

PS: 依赖Java8

  1. wget -c https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-7.6.2-linux-x86_64.tar.gz
  2. tar xf elasticsearch-7.6.2-linux-x86_64.tar.gz
  3. cd elasticsearch-7.6.2

方式2::通过官方提供的yum源来安装(需要root权限)

https://www.elastic.co/guide/en/elasticsearch/reference/7.6/rpm.html#rpm-repo

  1. # 1 导入GPG Key
  2. rpm --import https://artifacts.elastic.co/GPG-KEY-elasticsearch
  3. # 2 添加yum源
  4. cat > /etc/yum.repos.d/elasticsearch.repo << EOF
  5. [elasticsearch]
  6. name=Elasticsearch repository for 7.x packages
  7. baseurl=https://artifacts.elastic.co/packages/7.x/yum
  8. gpgcheck=1
  9. gpgkey=https://artifacts.elastic.co/GPG-KEY-elasticsearch
  10. enabled=0
  11. autorefresh=1
  12. type=rpm-md
  13. EOF
  14. # 3 指定yum源来安装
  15. yum install --enablerepo=elasticsearch elasticsearc

方式3:使用rpm安装

  1. wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-7.6.2-x86_64.rpm
  2. wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-7.6.2-x86_64.rpm.sha512
  3. shasum -a 512 -c elasticsearch-7.6.2-x86_64.rpm.sha512
  4. rpm -ivh elasticsearch-7.6.2-x86_64.rpm

安装mongo-connector

  1. pip install mongo-connector

安装elastic2-doc-manage

  1. pip install elastic2-doc-manager[elastic5]

使用

MongoDB开启副本集

启动ES

  1. /path/to/elasticsearch-7.6.2/bin/elaseticseach -d

数据同步

  1. mongo-connector \
  2. -m localhost:27015 \
  3. -t localhost:9200 \
  4. -d elastic2_doc_manager

使用配置文件

https://github.com/yougov/mongo-connector/wiki/Configuration-Options

  1. mongo-connector -c config.json
  1. {
  2. "__comments": "__开头的字段会被忽略",
  3. "mainAddress": "localhost:27015",
  4. "docManagers": [
  5. {
  6. "docManager": "elastic2_doc_manager",
  7. "targetURL": "localhost:9200",
  8. "autoCommitInterval": 0,
  9. "bulkSize": 5000,
  10. "args": {
  11. "clientOptions": {"timeout": 100}
  12. }
  13. }
  14. ]
  15. }

查询

  1. curl localhost:9200/_cat/indices # 查看indices列表
  2. curl localhost:9200/pubmed?pretty #查看pubmed index的字段信息等
  3. curl localhost:9200/pubmed/_search?pretty # 全文检索
  4. curl 'localhost:9200/pubmed/article/5eb64effc3b702070a873076' # 查询_index/_type/_id
  5. curl localhost:9200/pubmed/article/_search?pretty
  6. curl localhost:9200/pubmed/_search?pretty \
  7. -d '{"query": {"match": {"pmid": 123}}}' \
  8. -H "Content-Type: application/json"
  9. // URI查询
  10. curl 'localhost:9200/pubmed/article/_search?q=pmid:1234&pretty'

插件

ik

https://github.com/medcl/elasticsearch-analysis-ik/ 中文分词器

ES的默认分词器 standard 对中文分词不好(会拆成单个汉字)
ik分词器两种模式:

  • ik_smart: 粗颗粒度
  • ik_max_word: 细颗粒度

测试:

  1. curl 'localhost:9200/_analyze?pretty' \
  2. -H "Content-Type: application/json" \
  3. -d '{"analyzer": "ik_smart", "text": "搜狗输入法"}'
  4. # tokens: ['搜狗', '输入法']
  1. curl 'localhost:9200/_analyze?pretty' \
  2. -H "Content-Type: application/json" \
  3. -d '{"analyzer": "ik_max_word", "text": "搜狗输入法"}'
  4. # tokens: ['搜狗', '输入法', '输入', '法']

kibana

https://www.elastic.co/cn/downloads/kibana ES的可视化工具

  1. ./bin/kibana # 默认配置文件 config/kibana.yml

logstash

https://www.elastic.co/cn/downloads/logstash log日志分析工具