Grok Processor（Grok 处理器）

Grok Processor（Grok 处理器）

原文链接 : https://www.elastic.co/guide/en/elasticsearch/reference/5.3/grok-processor.html

译文链接 : http://www.apache.wiki/pages/viewpage.action?pageId=10027802

从 document（文档）中的单个 text filed（文本字段）提取 structured fields（结构化字段）。您可以选择从哪个字段来提取所匹配的字段，以及您想要匹配的 grok **pattern。grok pattern 就像正则表达式，并且支持可以重用的 aliased expressions**（别名表达式）。

此工具非常适用于 syslog logs，apache 和其它的 webserver logs，mysql logs，以及一般情况下，用于人类而不是计算机使用的任何的 log format（日志格式）。该 processor（处理器）包含超过 120 种可重用的 patterns。

如果您需要工具来帮助 building patterns（构建模式）以匹配 log（日志），您将会发现 http://grokdebug.herokuapp.com 和 http://grokconstructor.appspot.com/ 应用程序是相当有用的。

Grok Basics（Grok 基础）

Grok 以 regular expressions（正则表达式）为基础，所以在 Grok 中的任何正则表达式也是有效的。正则表达式库是 Oniguruma。您可以在 Onigiruma 网站上查看所支持的完整的 r**egexp syntax**（正在表达式语法）。

Grok 通过利用这种正则表达式语言来工作，允许命名现有的 pattern（模式），并将它们组合成与您的字段相匹配的更复杂的 pattern（模式）。

对于重用 grok pattern（grok 模式）的语法有三种形式 : %{SYNTAX:SEMANTIC}，%{SYNTAX}，**%{SYNTAX:SEMANTIC:TYPE}**。

该 SYNTAX（语法）是将要匹配您的文本的 pattern（模式）的名称。例如，3.44 将会被 NUMBER 模式匹配并且 55.3.244.1 将会被 IP 模式匹配。该语法是告诉你如何匹配的。**NUMBER**和 **IP**都是在 default patterns set（默认模式集）中提供的 pattern（模式）。

该 SEMANTIC（语义）是您给一段被匹配的文本的标识符。例如，3.44 可以是事件的持续时间，所以你可以简单的称之为 duration。此外，字符串 55.3.244.1 可能会标识 client 发出的请求。

该 TYPE（类型）是您希望转换您命名的 field（字段）的 type（类型）。int 和 float 是目前唯一所支持的强制类型。

例如，您可能想要去匹配以下文本 :

3.44 55.3.244.1

您可能知道该示例中的消息是一个 number（数字），后跟一个 IP address（IP 地址）。您可以通过使用下列的 Grok **expression（Grok** 表达式）来匹配这个文本。

%{NUMBER:duration} %{IP:client}

Using the Grok Processor in a Pipeline（在管道中使用 Grok 表达式）

Table 20. Grok Options（表 20. Grok 选项）

Name（名称）	Required（必要的）	Default（默认值）	Description（描述）
`field`	yes	-	The field to use for grok expression parsing
`patterns`	yes	-	An ordered list of grok expression to match and extract named captures with. Returns on the first expression in the list that matches.
`pattern_definitions`	no	-	A map of pattern-name and pattern tuples defining custom patterns to be used by the current processor. Patterns matching existing names will override the pre-existing definition.
`trace_match`	no	false	when true, `_ingest._grok_match_index` will be inserted into your matched document’s metadata with the index into the pattern found in `patterns` that matched.
`ignore_missing`	no	false	If `true` and `field` does not exist or is `null`, the processor quietly exits without modifying the document

以下是使用提供的 pattern（模式）从 document（文档）中的 string field（字符串字段）中提取和命名结构化字段的示例。

{
  "message": "55.3.244.1 GET /index.html 15824 0.043"
}

这个 pattern（模式）可以是 :

%{IP:client} %{WORD:method} %{URIPATHPARAM:request} %{NUMBER:bytes} %{NUMBER:duration}

以下是一个使用 Grok 处理上述 document（文档）的示例 pipeline（管道）:

{
  "description" : "...",
  "processors": [
    {
      "grok": {
        "field": "message",
        "patterns": ["%{IP:client} %{WORD:method} %{URIPATHPARAM:request} %{NUMBER:bytes} %{NUMBER:duration}"]
      }
    }
  ]
}

此 pipeline（管道）将这些 named captures（命名捕获）作为文档中的新字段插入，如下所示 :

{
  "message": "55.3.244.1 GET /index.html 15824 0.043",
  "client": "55.3.244.1",
  "method": "GET",
  "request": "/index.html",
  "bytes": 15824,
  "duration": "0.043"
}

Custom Patterns and Pattern Files（自定义模式和模式文件）

该 Grok processor 采用基本的 pattern（模式）进行预包装。这些 pattern（模式）可能并不总是有你想要的。Pattern 有一个非常基本的格式。每个 entry 描述有一个 name（名称）和 pattern（模式）本身。

您也可以在 pattern_definitions 选项下添加您自己的 pattern（模式）到 processor definition（处理器定义）中。

以下是一个指定自定义 pattern definitions（模式定义）的 pipeline（管道）:

{
  "description" : "...",
  "processors": [
    {
      "grok": {
        "field": "message",
        "patterns": ["my %{FAVORITE_DOG:dog} is colored %{RGB:color}"]
        "pattern_definitions" : {
          "FAVORITE_DOG" : "beagle",
          "RGB" : "RED|GREEN|BLUE"
        }
      }
    }
  ]
}

Providing Multiple Match Patterns（提供多个匹配模式）

有时一种 pattern（模式）不足以捕捉一个 field（字段）的潜在结构。假设我们要匹配包含您最喜欢的猫或狗宠物品种的所有 message（消息）。实现这一点的一个方法是提供两个不同的 pattern（模式），而不是一个真正复杂的表达式所捕获相同的 **or**行为。

以下是针对 simulate API（模拟 API）执行的这种配置的示例 :

curl -XPOST 'localhost:9200/_ingest/pipeline/_simulate?pretty' -H 'Content-Type: application/json' -d'
{
  "pipeline": {
  "description" : "parse multiple patterns",
  "processors": [
    {
      "grok": {
        "field": "message",
        "patterns": ["%{FAVORITE_DOG:pet}", "%{FAVORITE_CAT:pet}"],
        "pattern_definitions" : {
          "FAVORITE_DOG" : "beagle",
          "FAVORITE_CAT" : "burmese"
        }
      }
    }
  ]
},
"docs":[
  {
    "_source": {
      "message": "I love burmese cats!"
    }
  }
  ]
}
'

响应如下 :

{
  "docs": [
    {
      "doc": {
        "_type": "_type",
        "_index": "_index",
        "_id": "_id",
        "_source": {
          "message": "I love burmese cats!",
          "pet": "burmese"
        },
        "_ingest": {
          "timestamp": "2016-11-08T19:43:03.850+0000"
        }
      }
    }
  ]
}

两种 pattern（模式）都将使用适当的匹配来设置该字段 pet。但是如果要跟踪是哪以个模式匹配并且填充了字段，该怎么办呢？我们可以通过使用 trace_match参数来做到这一点。

以下是一个一样 pipeline（管道）的输出，但是使用的是 “trace_match“: true 的配置 :

{
  "docs": [
    {
      "doc": {
        "_type": "_type",
        "_index": "_index",
        "_id": "_id",
        "_source": {
          "message": "I love burmese cats!",
          "pet": "burmese"
        },
        "_ingest": {
          "_grok_match_index": "1",
          "timestamp": "2016-11-08T19:43:03.850+0000"
        }
      }
    }
  ]
}

在上述响应中，您可以看到匹配的 pattern（模式）的 index（索引）为 "**1**"。这就是说，它是在 patterns 中用于匹配的第二个（索引从零开始）模式。

这些所跟踪的元数据可以调试哪些 patterns（模式）被匹配到了。这些信息存储在 ingest metadata（元数据）中，并且不会被索引。