Word Delimiter Token Filter（Word Delimiter 词元过滤器）

Word Delimiter Token Filter（Word Delimiter 词元过滤器）

原文链接 : https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-word-delimiter-tokenfilter.html

译文链接 : http://www.apache.wiki/pages/viewpage.action?pageId=10028522

命名为 word_delimiter，它将单词分解为子词，并对子词组进行可选的转换。 词被分为以下规则的子词：

拆分字内分隔符（默认情况下，所有非字母数字字符）。.
“Wi-Fi” → “Wi”, “Fi”
按大小写转换拆分: “PowerShot” → “Power”, “Shot”
按字母数字转换拆分: “SD500” → “SD”, “500”
每个子词的前导和尾随的词内分隔符都被忽略: “//hello—-there, dude“ → “hello”, “there”, “dude”
每个子词都删除尾随的“s”: “O’Neil’s” → “O”, “Neil” 参数包括： generate_word_parts

true 导致生成单词部分：”PowerShot” ⇒ “Power” “Shot”。默认 true

generate_number_parts

true 导致生成数字子词：”500-42” ⇒ “500” “42”。默认 true

catenate_numbers

true 导致单词最大程度的链接到一起：”wi-fi” ⇒ “wifi”。默认 false

catenate_numbers

**true **导致数字最大程度的连接到一起：”500-42” ⇒ “50042”。默认 false

catenate_all

**true **导致所有的子词可以连接：”wi-fi-4000” ⇒ “wifi4000”。默认 false

split_on_case_change

true 导致 “PowerShot” 作为两个词元（”Power-Shot” 作为两部分看待）。默认 true

preserve_original

true 在子词中保留原始词： “500-42” ⇒ “500-42” “500” “42”。默认 false

split_on_numerics

true 导致 “j2se” 成为三个词元： “j” “2” “se”。默认 true

stem_english_possessive

true 导致每个子词中的 “‘s” 都会被移除：”O’Neil’s” ⇒ “O”, “Neil”。默认 true

高级设置：

protected_words

被分隔时的受保护词列表。一个数组，或者也可以将 protected_words_path 设置为配置有保护字的文件（每行一个）。如果存在，则自动解析为基于 config/ 位置的位置。

type_table

例如，自定义类型映射表（使用 type_table_path 配置时）：
```
    # Map the $, %, '.', and ',' characters to DIGIT
    # This might be useful for financial data.
    $ =&gt; DIGIT
    % =&gt; DIGIT
    . =&gt; DIGIT
    \\u002C =&gt; DIGIT
    # in some cases you might not want to split on ZWJ
    # this also tests the case where we need a bigger byte[]
    # see http://en.wikipedia.org/wiki/Zero-width_joiner
    \\u200D =&gt; ALPHANUM
```
提示：

使用像 standard** tokenizer 的 tokenizer 可能会干扰 catenate_ * 和 preserve_original 参数，因为原始字符串在分词期间可能已经丢失了标点符号。相反，您可能需要使用 whitespacetokenizer**。