大数据系统 - Storm 批量处理的三个关键配置参数 - 《编程技术碎片》

场景说明
参数说明
调参
举例
参考

场景说明

使用 Storm 对数据做清洗并写入数据库，为了减少写数据库的次数，可以在数据累积到一定量后，再批量写入数据库。

参数说明

TOPOLOGY_MAX_SPOUT_PENDING

The maximum number of tuples that can be pending on a spout task at any given time. This config applies to individual tasks, not to spouts or topologies as a whole.

A pending tuple is one that has been emitted from a spout but has not been acked or failed yet. Note that this config parameter has no effect for unreliable spouts that don’t tag their tuples with a message id.

单个 spout task 可允许的未 ack 的 tuple 数量，默认为1。设置得越大，对处理速度的要求越高，否则容易发生超时。

TOPOLOGY_MESSAGE_TIMEOUT_SECS

The maximum amount of time given to the topology to fully process a message emitted by a spout. If the message is not acked within this time frame, Storm will fail the message on the spout. Some spouts implementations will then replay the message at a later time.

单条 tuple 自发送后，允许 ack 的最大时长（单位：秒），默认为30。如果超时未 ack，会导致 spout 重新发送 tuple。

TOPOLOGY_TICK_TUPLE_FREQ_SECS

How often a tick tuple from the “_system“ component and ”_tick” stream should be sent to tasks. Meant to be used as a component-specific configuration.

发送 tick tuple 的时间间隔（单位：秒）。tick tuple 一般用于这种场景：积攒一定量的 tuple 后进行批量处理，比如写入数据库。

调参

记

SPOUT_TASK_NUM 为 spout task 的数量
SPEED 为想达到的处理速度
BATCH_TACKLE_TIME 为单次批量处理（如：写入数据库）的时间

设置参数需满足

TOPOLOGY_MESSAGE_TIMEOUT_SECS > TOPOLOGY_TICK_TUPLE_FREQ_SECS + BATCH_TACKLE_TIME 防止超时
TOPOLOGY_MAX_SPOUT_PENDING SPOUT_TASK_NUM >= SPEED （TOPOLOGY_TICK_TUPLE_FREQ_SECS + BATCH_TACKLE_TIME）保证单次处理足量的 tuple
BATCH_TACKLE_TIME 与单次处理的 tuple 数成正比（与 TOPOLOGY_TICK_TUPLE_FREQ_SECS、TOPOLOGY_MAX_SPOUT_PENDING 相关）

举例

假设

SPOUT_TASK_NUM = 1
SPEED = 1000
TOPOLOGY_TICK_TUPLE_FREQ_SECS = 4

可以假定 BATCH_TACKLE_TIME 为 4，则 TOPOLOGY_MAX_SPOUT_PENDING 至少应为 8000，TOPOLOGY_MESSAGE_TIMEOUT_SECS 至少为 8。接下来需要验证 BATCH_TACKLE_TIME 的实际情况，如果比假定的值更大则增大 TOPOLOGY_TICK_TUPLE_FREQ_SECS

参考

Running Topologies on a Production Cluster
Tick tuples within Storm
Apache Storm Design Pattern—Micro Batching