Pattern Replace Character Filter
原文链接 : https://www.elastic.co/guide/en/elasticsearch/reference/5.3/analysis-pattern-replace-charfilter.html
译文链接 :http://www.apache.wiki/display/Elasticsearch/Pattern+Replace+Character+Filter
Pattern Replace Character Filter 使用正则表达式来匹配应替换为指定替换字符串的字符。替换字符串可以引用正则表达式中的捕获组。
小心”病态”正则表达式
Pattern Replace Character Filter使用Java正则表达式。
一个“病态”的正则表达式可能会运行得非常慢,甚至会抛出一个StackOverflowError,还会导致其运行的节点突然退出。
更多信息请查看 pathological regular expressions and how to avoid them.
配置
Pattern Replace Character Filter 会接收以下参数:
| 参数名称 | 参数说明 |
|---|---|
pattern |
一个Java正则表达式. 必填 |
replacement |
替换字符串,可以使用$ 1 .. $ 9语法来引用捕获组,可以参考这里。 |
flags |
Java正则表达式标志。 标志应该用“|”进行分割,例如“CASE_INSENSITIVE | COMMENTS”。 |
在下面例子中,我们使用Pattern Replace Character Filter实现用下划线替代任何嵌入的破折号,即123-456-789→123_456_789:
案例
PUT my_index{"settings": {"analysis": {"analyzer": {"my_analyzer": {"tokenizer": "standard","char_filter": ["my_char_filter"]}},"char_filter": {"my_char_filter": {"type": "pattern_replace","pattern": "(\\d+)-(?=\\d)","replacement": "$1_"}}}}}POST my_index/_analyze{"analyzer": "my_analyzer","text": "My credit card is 123-456-789"}
上面案例将返回如下结果:
[ My, credit, card, is 123_456_789 ]
出于搜索目的使用替换字符串,会引起原始文本长度的更改,进而导致不正确的高亮显示。案例如下所示。
这个案例在遇到小写字母后跟大写字母(即fooBarBaz→foo Bar Baz)时插入空格,允许单独查询camelCase字词:
案例
PUT my_index{"settings": {"analysis": {"analyzer": {"my_analyzer": {"tokenizer": "standard","char_filter": ["my_char_filter"],"filter": ["lowercase"]}},"char_filter": {"my_char_filter": {"type": "pattern_replace","pattern": "(?<=\\p{Lower})(?=\\p{Upper})","replacement": " "}}}},"mappings": {"my_type": {"properties": {"text": {"type": "text","analyzer": "my_analyzer"}}}}}POST my_index/_analyze{"analyzer": "my_analyzer","text": "The fooBarBaz method"}
案例返回的结果如下:
[ the, foo, bar, baz, method ]
查询 bar 可以正确找到文档,但突出显示结果将产生不正确的高光,因为我们的字符过滤器更改了原始文本的长度:
PUT my_index/my_doc/1?refresh{"text": "The fooBarBaz method"}GET my_index/_search{"query": {"match": {"text": "bar"}},"highlight": {"fields": {"text": {}}}}
以上的输出结果是:
{"timed_out": false,"took": $body.took,"_shards": {"total": 5,"successful": 5,"failed": 0},"hits": {"total": 1,"max_score": 0.2824934,"hits": [{"_index": "my_index","_type": "my_doc","_id": "1","_score": 0.2824934,"_source": {"text": "The fooBarBaz method"},"highlight": {"text": ["The foo<em>Ba</em>rBaz method"]}}]}}
