Hot&Warm架构 - ES数据冷热分离 - 《大数据中间件》

背景
方案

背景

对于日志型应用来说，一般每天建立一个新索引，当天的热索引在写入的同时也会有较多的查询。如果集群存有较久之前的冷数据，当用户做时间跨度较大的历史数据查询时，过多的磁盘IO和CPU消耗很容易拖慢写入，造成数据的延迟。所以使用一部分机器来做冷数据的存储，利用ES可以给结点配置自定义属性的功能，为冷结点加上”boxtype”:”weak”的标识，每晚通过维护脚本更新冷数据的索引路由设置index.routing.allocation.{require|include|exclude}，让数据自动向冷结点迁移。
冷数据的特性是不再写入，用户查的频率较低，但量级可能很大。
比如有个索引每天2TB，并且要保持过去90天数据随时可查。保持这么大量的索引为open状态，并非只消耗磁盘空间。ES为了快速访问磁盘上的索引文件，需要在内存里驻留一些数据(索引文件的索引)，也就是所谓的segment memory。

方案

集群中x台SSD机器tag设置为hot，y台HDD机器tag设置为cold(This can be achieved by assigning arbitrary attributes to each server)
hot节点中只存最近两天的
定时任务每天将前一天的索引标记为cold
ES根据新的标记会自动将该索引迁移到冷节点中

ES允许通过为每个节点分配属性标签的方式实现区分冷热节点
可以在配置文件中设置awareness.attributes也可以通过cluster-update-settings API动态修改

For instance, you could tag the node with node.attr.box_type: hot in elasticsearch.yml, or you could start a node using ./bin/elasticsearch -Enode.attr.box_type=hot
The nodes on the warm zone are “tagged” with node.attr.box_type: warm in elasticsearch.yml or you could start a node using ./bin/elasticsearch -Enode.attr.box_type=warm
box_type的属性可以根据实际需求设置为任意标识，也可以使用zone等，这些标识将告诉ES如何分配索引
通过使用以下设置创建索引，以确保其被分配到使用SSD的热节点上：

PUT /nginx
{
  "settings": {
    "index.routing.allocation.require.box_type": "hot"
  }
}

生产环境实施步骤
1、修改elasticsearch template，添加参数index.routing.allocation.require.box_type: hot
2、修改es配置，添加配置项node.attr.box_type,热集群为hot，冷集群为cold
3、重启elasitcsearch集群，默认创建的索引都会到hot集群，需要写一个定时脚本，将过期热数据索引的index.routing.allocation.require.box_type改为cold;
脚本参考如下:

#/bin/bash
DATE=`date -d "-2day" +%Y.%m.%d`
curl -H "Content-Type: application/json" -XPUT http://localhost:9200/*${DATE}*/_settings -d '{"index.routing.allocation.require.box_type":"cold","index.routing.allocation.exclude.box_type":""}'

TIPS:两台机器，副本数必须为0，副本数为1时，需要四台机器

参考
5.x版本热暖架构
 5.5冷热数据分离
 大规模ES集群管理心得