Compound Word Token Filter（复合词过滤器）

Compound Word Token Filter（复合词过滤器）

原文链接 :https://www.elastic.co/guide/en/elasticsearch/reference/5.4/analysis-compound-word-tokenfilter.html

译文链接 : http://www.apache.wiki/pages/viewpage.action?pageId=10027907

贡献者 : 李亚运，ApacheCN，Apache中文网

简述

hyphenation_decompounder和dictionary_decompounder过滤器可以将许多德语中的复合词进行拆分。

两个过滤器都需要单词字典，可以按如下方式提供：

Hyphenation decompounder（连词分解）

hyphenation_decompounder使用连字符语法来查找潜在的字词，然后对单词字典进行检查。输出的token质量与您使用的语法文件的质量直接相关。对于像德语这样的语言是非常适用的。

基于XML的连字符语法文件可以在“ 对象格式化对象（OFFO）Sourceforge”项目中找到。目前仅支持FOP v1.2兼容连字符文件。您可以直接下载offo-hyphenation_v1.2.zip并查看offo-hyphenation/hyph/目录。想了解更多可以去查看Apache FOP项目。

Dictionary decompounder（字典分解）

dictionary_decompounder使用强力方法与仅字典字典结合使用复合词中的子词。它比连字符分解器慢得多，但可以作为检验字典质量的第一步。

Compound token filter parameters（复合词元过滤器参数）

以下参数可用于配置复合词元过滤器：

如下例所示：

index :
    analysis :
        analyzer :
            myAnalyzer2 :
                type : custom
                tokenizer : standard
                filter : [myTokenFilter1, myTokenFilter2]
        filter :
            myTokenFilter1 :
                type : dictionary_decompounder
                word_list: [one, two, three]
            myTokenFilter2 :
                type : hyphenation_decompounder
                word_list_path: path/to/words.txt
                hyphenation_patterns_path: path/to/fop.xml
                max_subword_size : 22