An operator represents a single, ideally idempotent(幂等), task. Operators determine what actually executes when your DAG runs.
:::info
🔖 Note:
See the Operators Concepts documentation and the Operators API Reference for more information.
:::
1. BashOperator
Use the [BashOperator](https://airflow.apache.org/docs/apache-airflow/stable/_api/airflow/operators/bash/index.html#airflow.operators.bash.BashOperator)
to execute commands in a Bash shell.
📑 airflow/example_dags/example_bash_operator.py
run_this = BashOperator(
task_id='run_after_loop',
bash_command='echo 1',
)
1.1 Templating
You can use Jinja templates to parameterize the bash_command
argument.
📑 airflow/example_dags/example_bash_operator.py
also_run_this = BashOperator(
task_id='also_run_this',
bash_command='echo "run_id={{ run_id }} | dag_run={{ dag_run }}"',
)
:::info
⚠ Warning:
Care should be taken with “user” input or when using Jinja templates in the bash_command
, as this bash operator does not perform any escaping or sanitization(处理) of the command.
This applies mostly to using “dag_run” conf, as that can be submitted via users in the Web UI. Most of the default template variables are not at risk. :::
For example, do not do this:
bash_task = BashOperator(
task_id="bash_task",
bash_command='echo "Here is the message: \'{{ dag_run.conf["message"] if dag_run else "" }}\'"',
)
Instead, you should pass this via the env
kwarg and use double-quotes inside the bash_command, as below:
bash_task = BashOperator(
task_id="bash_task",
bash_command='echo "here is the message: \'$message\'"',
env={'message': '{{ dag_run.conf["message"] if dag_run else "" }}'},
)
1.2 Troubleshooting
Jinja template not found
Add a space after the script name when directly calling a Bash script with the bash_command
argument. This is because Airflow tries to apply a Jinja template to it, which will fail.
t2 = BashOperator(
task_id='bash_example',
# This fails with 'Jinja template not found' error
# bash_command="/home/batcher/test.sh",
# This works (has a space after)
bash_command="/home/batcher/test.sh ",
dag=dag)
However, if you want to use templating in your bash script, do not add the space and instead put your bash script in a location relative to the directory containing the DAG file. So if your DAG file is in /usr/local/airflow/dags/test_dag.py
, you can move your test.sh
file to any location under /usr/local/airflow/dags/
(Example: /usr/local/airflow/dags/scripts/test.sh
) and pass the relative path to bash_command
as shown below:
t2 = BashOperator(
task_id='bash_example',
# "scripts" folder is under "/usr/local/airflow/dags"
bash_command="scripts/test.sh",
dag=dag)
Creating separate folder for bash scripts may be desirable for many reasons, like separating your script’s logic and pipeline code, allowing for proper code highlighting in files composed in different languages, and general flexibility in structuring pipelines.
It is also possible to define your template_searchpath
as pointing to any folder locations in the DAG constructor call.
Example:
dag = DAG("example_bash_dag", template_searchpath="/opt/scripts")
t2 = BashOperator(
task_id='bash_example',
# "test.sh" is a file under "/opt/scripts"
bash_command="test.sh ",
dag=dag)
2. PythonOperator
Use the [PythonOperator](https://airflow.apache.org/docs/apache-airflow/stable/_api/airflow/operators/python/index.html#airflow.operators.python.PythonOperator)
to execute Python callables.
📑 airflow/example_dags/example_python_operator.py
def print_context(ds, **kwargs):
"""Print the Airflow context and ds variable from the context."""
pprint(kwargs)
print(ds)
return 'Whatever you return gets printed in the logs'
run_this = PythonOperator(
task_id='print_the_context',
python_callable=print_context,
)
2.1 Passing in arguments
Use the op_args
and op_kwargs
arguments to pass additional arguments to the Python callable.
📑 airflow/example_dags/example_python_operator.py
def my_sleeping_function(random_base):
"""This is a function that will run within the DAG execution"""
time.sleep(random_base)
# Generate 5 sleeping tasks, sleeping from 0.0 to 0.4 seconds respectively
for i in range(5):
task = PythonOperator(
task_id='sleep_for_' + str(i),
python_callable=my_sleeping_function,
op_kwargs={'random_base': float(i) / 10},
)
run_this >> task
2.2 Templating
Airflow passes in an additional set of keyword arguments: one for each of the Jinja template variables and a templates_dict
argument.
The templates_dict
argument is templated, so each value in the dictionary is evaluated as a Jinja template.
3. PythonVirtualenvOperator
Use the [PythonVirtualenvOperator](https://airflow.apache.org/docs/apache-airflow/stable/_api/airflow/operators/python/index.html#airflow.operators.python.PythonVirtualenvOperator)
to execute Python callables inside a new Python virtual environment.
📑 airflow/example_dags/example_python_operator.py
def callable_virtualenv():
"""
Example function that will be performed in a virtual environment.
Importing at the module level ensures that it will not attempt to import the
library before it is installed.
"""
from time import sleep
from colorama import Back, Fore, Style
print(Fore.RED + 'some red text')
print(Back.GREEN + 'and with a green background')
print(Style.DIM + 'and in dim text')
print(Style.RESET_ALL)
for _ in range(10):
print(Style.DIM + 'Please wait...', flush=True)
sleep(10)
print('Finished')
virtualenv_task = PythonVirtualenvOperator(
task_id="virtualenv_python",
python_callable=callable_virtualenv,
requirements=["colorama==0.4.0"],
system_site_packages=False,
)
3.1 Passing in arguments
You can use the op_args
and op_kwargs
arguments the same way you use it in the PythonOperator. Unfortunately we currently do not support to serialize var
and ti
/ task_instance
due to incompatibilities with the underlying library. For airflow context variables make sure that you either have access to Airflow through setting system_site_packages
to True
or add apache-airflow
to the requirements
argument. Otherwise you won’t have access to the most context variables of Airflow in op_kwargs
. If you want the context related to datetime objects like execution_date
you can add pendulum
and lazy_object_proxy
.
3.2 Templating
You can use jinja Templating the same way you use it in PythonOperator.
4. Cross-DAG Dependencies
When two DAGs have dependency relationships, it is worth considering combining them into a single DAG, which is usually simpler to understand. Airflow also offers better visual representation of dependencies for tasks on the same DAG. However, it is sometimes not practical to put all related tasks on the same DAG. For example:
- Two DAGs may have different schedules. E.g. a weekly DAG may have tasks that depend on other tasks on a daily DAG.
- Different teams are responsible for different DAGs, but these DAGs have some cross-DAG dependencies.
- A task may depend on another task on the same DAG, but for a different
execution_date
.
ExternalTaskSensor
can be used to establish such dependencies across different DAGs. When it is used together with ExternalTaskMarker
, clearing dependent tasks can also happen across different DAGs.
4.1 ExternalTaskSensor
Use the [ExternalTaskSensor](https://airflow.apache.org/docs/apache-airflow/stable/_api/airflow/sensors/external_task/index.html#airflow.sensors.external_task.ExternalTaskSensor)
to make tasks on a DAG wait for another task on a different DAG for a specific execution_date
.
ExternalTaskSensor also provide options to set if the Task on a remote DAG succeeded or failed via allowed_states
and failed_states
parameters.
📑 airflow/example_dags/example_external_task_marker_dag.py
child_task1 = ExternalTaskSensor(
task_id="child_task1",
external_dag_id=parent_dag.dag_id,
external_task_id=parent_task.task_id,
timeout=600,
allowed_states=['success'],
failed_states=['failed', 'skipped'],
mode="reschedule",
)
4.2 ExternalTaskMarker
If it is desirable that whenever parent_task
on parent_dag
is cleared, child_task1
on child_dag
for a specific execution_date
should also be cleared, ExternalTaskMarker
should be used. Note that child_task1
will only be cleared if “Recursive” is selected when the user clears parent_task
.
📑 airflow/example_dags/example_external_task_marker_dag.py
parent_task = ExternalTaskMarker(
task_id="parent_task",
external_dag_id="example_external_task_marker_child",
external_task_id="child_task1",
)