关联关系类型 - join - 《elasticsearch》

join类型

`join`类型

join类型是一个特殊的字段，它创建了位于同一个索引内文档的父/子关系。relations部分定义了存在于文档内的一系列可能的关系。每个关系需要一个父名，和子名。一个父子关系可以定义为如下方式：

PUT my_index
{
  "mappings": {
    "_doc": {
      "properties": {
        "my_join_field": { 
          "type": "join",
          "relations": {
            "question": "answer" 
          }
        }
      }
    }
  }
}

my_join_field是字段的名字；

定义单个关系，question是answer的父级。

索引带 join 的文档，关系的名字，文档的可选父级必须在source中提供。下面例子中创建了2个使用了question上下文创建的父级文档。

PUT my_index/_doc/1?refresh
{
  "text": "This is a question",
  "my_join_field": {
    "name": "question" 
  }
}
PUT my_index/_doc/2?refresh
{
  "text": "This is a another question",
  "my_join_field": {
    "name": "question"
  }
}

该文档是一个question文档。

当索引父级文档时，可以直接指定关系名称，而不必封装在对象表示中：

PUT my_index/_doc/1?refresh
{
  "text": "This is a question",
  "my_join_field": "question" 
}
PUT my_index/_doc/2?refresh
{
  "text": "This is another question",
  "my_join_field": "question"
}

使用关系名称简单表示父级文档。

当索引子级文档时，关系的名称与父文档的 id 必须添加到_source字段中。

要求一系的文档必须索引到同一个分片中，所以子文档必须使用父文档 id 进行路由。

下面两个例子展示了如何索引两个child文档：

PUT my_index/_doc/3?routing=1&refresh  (1)
{
  "text": "This is an answer",
  "my_join_field": {
    "name": "answer", （2）
    "parent": "1" (3)
  }
}
PUT my_index/_doc/4?routing=1&refresh
{
  "text": "This is another answer",
  "my_join_field": {
    "name": "answer",
    "parent": "1"
  }
}

路由值是强制的，因为父子文档必须索引到同一个分片上；

answer是这个文档join字段的名称；

该子文档的父文档 ID。

父子文档性能

join 字段的使用不能像关系型数据库中的联接一样使用。elasticsearch 中良好性能的关键是反范式数据到文档中。每一个 join 字段，has_child，has_parent查询都极大的拖慢查询性能。

join 字段有意义的唯一情况就是数据包含一对多的关系，其中一个实体显著多余另一个实体。一个例子就是产品和这些产品的报价。这个例子中，报价的数量显著多余产品的数量，这样就可以定义产品为父文档，报价为子文档的模型。

父子关系限制

每个索引只允许使用一个join字段。
父文档和子文档必须索引到一个分片上。这意味着当查询、获取、更新子文档时都必须使用同一个routing值。
一个元素可以有多个子元素，但只有一个父元素。
可以向现有的join字段添加新的关系。
可以向现有元素添加子元素，但条件是现有元素必须是父元素。

使用父子连接搜索

parent-join 在文档内创建了一个字段索引关系的名称(my_parent, my_child, …)。

它也为每个父子关系创建了一个字段。该字段的名称就是join字段的名称，后面跟随 #和关系中父级的名称。因此，例如my_parent =>[my_child, aother_child]关系，join字段创建了一种额外的字段称为my_join_field#my_parent。

如果该文档是子级(my_child or another_child)，那么这个字段包含了父级_id，如果该文档是父级（my_parent），则包含其_id。

当查询包含join字段的索引，这两个字段总会在查询响应中返回：

GET my_index/_search
{
  "query": {
    "match_all": {}
  },
  "sort": ["_id"]
}

响应：

{
    ...,
    "hits": {
        "total": 4,
        "max_score": null,
        "hits": [
            {
                "_index": "my_index",
                "_type": "_doc",
                "_id": "1",
                "_score": null,
                "_source": {
                    "text": "This is a question",
                    "my_join_field": "question" (1)
                },
                "sort": [
                    "1"
                ]
            },
            {
                "_index": "my_index",
                "_type": "_doc",
                "_id": "2",
                "_score": null,
                "_source": {
                    "text": "This is another question",
                    "my_join_field": "question" 
                },
                "sort": [
                    "2"
                ]
            },
            {
                "_index": "my_index",
                "_type": "_doc",
                "_id": "3",
                "_score": null,
                "_routing": "1",
                "_source": {
                    "text": "This is an answer",
                    "my_join_field": {
                        "name": "answer", (2)
                        "parent": "1"  (3)
                    }
                },
                "sort": [
                    "3"
                ]
            },
            {
                "_index": "my_index",
                "_type": "_doc",
                "_id": "4",
                "_score": null,
                "_routing": "1",
                "_source": {
                    "text": "This is another answer",
                    "my_join_field": {
                        "name": "answer",
                        "parent": "1"
                    }
                },
                "sort": [
                    "4"
                ]
            }
        ]
    }
}

该文档属于quesion联接；
该文档属于answer联接；
该子文档联接的父文档 ID；

Parent-join 查询和聚合

查看 has_child and has_parent 查询，the children 聚合， inner hits 等。

join 字段的值在聚合和脚本操作中也是可以访问到的，也可以使用parent_id query 查询：

GET my_index/_search
{
  "query": {
    "parent_id": { (1)
      "type": "answer",
      "id": "1"
    }
  },
  "aggs": {
    "parents": {
      "terms": {
        "field": "my_join_field#question", 
        "size": 10
      }
    }
  },
  "script_fields": {
    "parent": {
      "script": {
         "source": "doc['my_join_field#question']" 
      }
    }
  }
}

查询 parent id 字段；
在 parent id字段上聚合(also see the children aggregation)；
在 script 中获取 parent id 的值；

全局序数

join字段使用 global ordinals 来加速联接。分片有任何改变都需要重建全局序数。分片中存储的父级 ID 值越多，为join字段重建全局序数花费的时间就越久。

默认情况下，全局序数会急切的构建：如果索引更改，就会在刷新过程中重建join字段的全局序数，这将会显著的增加刷新时间。然而，大多数情况下这是正确的权衡，否则，全局序数会在第一次父子查询或者聚合查询时重建。这可能会给您的用户带来严重的延迟峰值，并且通常会更糟，因为当发生多次写入时，可能会在单个刷新间隔内尝试重建连接字段的多个全局序数。

如果不经常使用join字段，且写入频繁时，禁用急切加载比较合适：

PUT my_index
{
  "mappings": {
    "_doc": {
      "properties": {
        "my_join_field": {
          "type": "join",
          "relations": {
             "question": "answer"
          },
          "eager_global_ordinals": false
        }
      }
    }
  }
}

可以按每个父级关系来查看全局序数堆的使用总量：

# Per-index
GET _stats/fielddata?human&fields=my_join_field#question
# Per-node per-index
GET _nodes/stats/indices/fielddata?human&fields=my_join_field#question

同一个父级多个子文档

同一个父级可以有多个子级：

PUT my_index
{
  "mappings": {
    "_doc": {
      "properties": {
        "my_join_field": {
          "type": "join",
          "relations": {
            "question": ["answer", "comment"]  
          }
        }
      }
    }
  }
}

question是answer和comment的父级；

多层级父子关系

不建议使用多层级关系来复制关系模型。每级关系都会在查询时增加内存和计算的开销。如果很在乎性能应该对数据去规范化。

多层级的父子关系：

PUT my_index
{
  "mappings": {
    "_doc": {
      "properties": {
        "my_join_field": {
          "type": "join",
          "relations": {
            "question": ["answer", "comment"],  (1)
            "answer": "vote" (2)
          }
        }
      }
    }
  }
}

question是answer和comment的父级；
answer 是vote的父级；

上面的映射表示为下面的树结构：

        question
        /     \
       /       \
    answer    comment
                 |
                 |
                vote

索引一个孙子文档需要一个跟祖父相同的routing值（世系中的祖父）。

PUT my_index/_doc/3?routing=1&refresh 
{
  "text": "This is a vote",
  "my_join_field": {
    "name": "vote",
    "parent": "2" 
  }
}

该子文档必须与其父文档，祖父文档在同一个分片上；

该文档的父 ID（必须指向 answer 文档）；

翻译

https://www.elastic.co/guide/en/elasticsearch/reference/6.3/parent-join.html

join

join类型