What is the “task” in Storm parallelism
question
I’m trying to learn twitter storm by following the great article “Understanding the parallelism of a Storm topology“
However I’m a bit confused by the concept of “task”. Is a task an running instance of the component(spout or bolt) ? A executor having multiple tasks actually is saying the same component is executed for multiple times by the executor, am I correct ?
Moreover in a general parallelism sense, Storm will spawn a dedicated(专用) thread(executor) for a spout or bolt, but what is contributed to the parallelism by an executor(thread) having multiple tasks ? I think having multiple tasks in a thread, since a thread executes sequentially, only make the thread a kind of “cached” resource, which avoids spawning new thread for next task run. Am I correct?
—-
answer
However I’m a bit confused by the concept of “task”. Is a task an running instance of the component(spout or bolt) ? A executor having multiple tasks actually is saying the same component is executed for multiple times by the executor, am I correct ?
Yes, and yes.
Moreover in a general parallelism sense, Storm will spawn a dedicated thread(executor) for a spout or bolt, but what is contributed to the parallelism by an executor(thread) having multiple tasks ?
Running more than one task per executor does not increase the level of parallelism — an executor always has one thread that it uses for all of its tasks, which means that tasks run serially on an executor.
每个executor运行超过一个数量的task并不会提升并行度,一个executor总是会只有一个线程给所有的task来用。这意味着这些task是顺序运行的。
As I wrote in the article please note that:
- The number of executor threads can be changed after the topology has been started (see
storm rebalance
command).(executor数量可以动态修改) - The number of tasks of a topology is static.(拓扑启动以后,task数量不可动态修改。这里一个简单的理解,在stream grouping中的fields grouping策略,其实根据mod hash进行路由的。如果task数量发生变化,则可能导致新的路由地址跟之前不一致)
And by definition there is the invariant of #executors <= #tasks
.