场景说明

使用 Storm 对数据做清洗并写入数据库,为了减少写数据库的次数,可以在数据累积到一定量后,再批量写入数据库。

参数说明

TOPOLOGY_MAX_SPOUT_PENDING

The maximum number of tuples that can be pending on a spout task at any given time. This config applies to individual tasks, not to spouts or topologies as a whole.

A pending tuple is one that has been emitted from a spout but has not been acked or failed yet. Note that this config parameter has no effect for unreliable spouts that don’t tag their tuples with a message id.

单个 spout task 可允许的未 ack 的 tuple 数量,默认为1。设置得越大,对处理速度的要求越高,否则容易发生超时。

TOPOLOGY_MESSAGE_TIMEOUT_SECS

The maximum amount of time given to the topology to fully process a message emitted by a spout. If the message is not acked within this time frame, Storm will fail the message on the spout. Some spouts implementations will then replay the message at a later time.

单条 tuple 自发送后,允许 ack 的最大时长(单位:秒),默认为30。如果超时未 ack,会导致 spout 重新发送 tuple。

TOPOLOGY_TICK_TUPLE_FREQ_SECS

How often a tick tuple from the “_system“ component and ”_tick” stream should be sent to tasks. Meant to be used as a component-specific configuration.

发送 tick tuple 的时间间隔(单位:秒)。tick tuple 一般用于这种场景:积攒一定量的 tuple 后进行批量处理,比如写入数据库。

调参

  • SPOUT_TASK_NUM 为 spout task 的数量
  • SPEED 为想达到的处理速度
  • BATCH_TACKLE_TIME 为单次批量处理(如:写入数据库)的时间

设置参数需满足

  • TOPOLOGY_MESSAGE_TIMEOUT_SECS > TOPOLOGY_TICK_TUPLE_FREQ_SECS + BATCH_TACKLE_TIME 防止超时
  • TOPOLOGY_MAX_SPOUT_PENDING SPOUT_TASK_NUM >= SPEED (TOPOLOGY_TICK_TUPLE_FREQ_SECS + BATCH_TACKLE_TIME)保证单次处理足量的 tuple
  • BATCH_TACKLE_TIME 与单次处理的 tuple 数成正比(与 TOPOLOGY_TICK_TUPLE_FREQ_SECS、TOPOLOGY_MAX_SPOUT_PENDING 相关)

举例

假设

  • SPOUT_TASK_NUM = 1
  • SPEED = 1000
  • TOPOLOGY_TICK_TUPLE_FREQ_SECS = 4

可以假定 BATCH_TACKLE_TIME 为 4,则 TOPOLOGY_MAX_SPOUT_PENDING 至少应为 8000,TOPOLOGY_MESSAGE_TIMEOUT_SECS 至少为 8。接下来需要验证 BATCH_TACKLE_TIME 的实际情况,如果比假定的值更大则增大 TOPOLOGY_TICK_TUPLE_FREQ_SECS

参考

Running Topologies on a Production Cluster
Tick tuples within Storm
Apache Storm Design Pattern—Micro Batching