作业容错处理

Failure options

  • Finish Current Running will finish the jobs that are currently running, but it will not start new jobs. The flow will be put in the FAILED FINISHING state and be set to FAILED once everything completes.
  • Cancel All will immediately kill all running jobs and set the state of the executing flow to FAILED.
  • Finish All Possible will keep executing jobs in the flow as long as its dependencies are met. The flow will be put in the FAILED FINISHING state and be set to FAILED once everything completes.

Azkaban运维手册 - 图1

Finish Current Running

当有节点出现失败时,不会产生其他新的job,只会finish当前的job,其他正在运行的job不管,作业流的状态从“FAILED FINISHING”到“FAILED”。
如下作业流场景:
作业依赖关系:jobC依赖jobB1、jobB2、jobB3,jobB1、jobB2、jobB3依赖jobA
执行时,jobB1发常异常,则jobC被取消掉,状态变为canceled。
image.png
image.png

Cancel All

会立即取消所有正在运行的job,作业流的状态变成FAILED
场景:还是上面的作业流,再次执行
image.png
image.png
这时,jobB1发生异常,被Failed掉,jobC被canceled掉,jobB3由于正在运行,也被killed了。对于jobB3来说,由于jobB1发生了异常,导致jobB3中途被killed掉,这种体验不好,很容易发生状态、数据一致性等为题。

Finish All Possible

只要依赖能满足,就会经量执行job流。
场景:还是上面的作业流,再次执
image.png
分析:jobB1发生了异常,被FAILED掉,jobB2、jobB3继续运行,jobC由于依赖的作业中jobB1失败了,所以自己也被canceld掉了。

Concurrent Options

  • Skip Execution option will not run the flow if its already running.
  • Run Concurrently option will run the flow regardless of if its running. Executions are given different working directories.
  • Pipeline runs the the flow in a manner that the new execution will not overrun the concurrent execution.
    • Level 1: blocks executing job A until the the previous flow’s job A has completed.
    • Level 2: blocks executing job A until the the children of the previous flow’s job A has completed. This is useful if you need to run your flows a few steps behind an already executin flow.

Azkaban运维手册 - 图7
Skip Execution:当前作业已经在执行了,如果再次执行,则会被跳过。
Run Concurrently:并发执行,当前作业已经在执行了,如果再次执行,则会一起执行。
Pipeline:串行执行
level 1:上一次作业流的jobA执行完后,当前的jobA才会执行。
level 2:必须等到上一次作业流全部完成后,当前作业流才会执行。

人工干预作业

  • Cancel - kills all running jobs and fails the flow immediately. The flow state will be KILLED.
  • Pause - prevents new jobs from running. Currently running jobs proceed as usual.
  • Resume - resume a paused execution.
  • Retry Failed - only available when the flow is in a FAILED FINISHING state. Retry will restart all FAILED jobs while the flow is still active. Attempts will appear in the Jobs List page.
  • Prepare Execution - only available on a finished flow, regardless of success or failures. This will auto disable successfully completed jobs.

Prepare Execution:当作业流完成时,会出现Prepare Execution按钮
如果上次执行完成后,有作业成功和作业失败,那么再次执行的时候,上一次成功的作业会被跳过。
image.png
Cancel:当作业正在运行时,会出现Cancel按钮 (可能时版本原因,本程序出现的时kill按钮)
场景:jobB1正在运行,执行kill按钮,jobB1变成killed状态,jobC变成cancelled状态
image.png
Azkaban运维手册 - 图10
执行kill按钮后,正在运行的job会变成:killed状态,而未执行的作业会变成cancelled状态。
接下来,如果再次执行本作业流,那么其他在上次执行成功的作业会被跳过。
场景:jobB1和jobC被再次执行,jobA、jobB2、jobB3被跳过。
image.png
Retry Failed:当前作业流状态时FAILED_FINISHING(作业流正在完成中,但是出现了失败的作业),会出现“retry failed”按钮,执行时会对失败的作业重试。
image.png
image.png
场景:jobB2失败了,执行重试按钮时,jobB2会被重试。
Pause:当作业流在运行时,而且作业流没出现失败的作业,则会出现”Pause”暂停按钮。
Resume:恢复暂停的作业流。