场景说明
使用 Storm 对数据做清洗并写入数据库,为了减少写数据库的次数,可以在数据累积到一定量后,再批量写入数据库。
参数说明
TOPOLOGY_MAX_SPOUT_PENDING
The maximum number of tuples that can be pending on a spout task at any given time. This config applies to individual tasks, not to spouts or topologies as a whole.
A pending tuple is one that has been emitted from a spout but has not been acked or failed yet. Note that this config parameter has no effect for unreliable spouts that don’t tag their tuples with a message id.
单个 spout task 可允许的未 ack 的 tuple 数量,默认为1。设置得越大,对处理速度的要求越高,否则容易发生超时。
TOPOLOGY_MESSAGE_TIMEOUT_SECS
The maximum amount of time given to the topology to fully process a message emitted by a spout. If the message is not acked within this time frame, Storm will fail the message on the spout. Some spouts implementations will then replay the message at a later time.
单条 tuple 自发送后,允许 ack 的最大时长(单位:秒),默认为30。如果超时未 ack,会导致 spout 重新发送 tuple。
TOPOLOGY_TICK_TUPLE_FREQ_SECS
How often a tick tuple from the “_system“ component and ”_tick” stream should be sent to tasks. Meant to be used as a component-specific configuration.
发送 tick tuple 的时间间隔(单位:秒)。tick tuple 一般用于这种场景:积攒一定量的 tuple 后进行批量处理,比如写入数据库。
调参
记
- SPOUT_TASK_NUM 为 spout task 的数量
- SPEED 为想达到的处理速度
- BATCH_TACKLE_TIME 为单次批量处理(如:写入数据库)的时间
设置参数需满足
- TOPOLOGY_MESSAGE_TIMEOUT_SECS > TOPOLOGY_TICK_TUPLE_FREQ_SECS + BATCH_TACKLE_TIME 防止超时
- TOPOLOGY_MAX_SPOUT_PENDING SPOUT_TASK_NUM >= SPEED (TOPOLOGY_TICK_TUPLE_FREQ_SECS + BATCH_TACKLE_TIME)保证单次处理足量的 tuple
- BATCH_TACKLE_TIME 与单次处理的 tuple 数成正比(与 TOPOLOGY_TICK_TUPLE_FREQ_SECS、TOPOLOGY_MAX_SPOUT_PENDING 相关)
举例
假设
- SPOUT_TASK_NUM = 1
- SPEED = 1000
- TOPOLOGY_TICK_TUPLE_FREQ_SECS = 4
可以假定 BATCH_TACKLE_TIME 为 4,则 TOPOLOGY_MAX_SPOUT_PENDING 至少应为 8000,TOPOLOGY_MESSAGE_TIMEOUT_SECS 至少为 8。接下来需要验证 BATCH_TACKLE_TIME 的实际情况,如果比假定的值更大则增大 TOPOLOGY_TICK_TUPLE_FREQ_SECS
参考
Running Topologies on a Production Cluster
Tick tuples within Storm
Apache Storm Design Pattern—Micro Batching