🚀 原文链接:https://airflow.apache.org/docs/apache-airflow/stable/modules_management.html

Airflow allows you to use your own Python modules in the DAG and in the Airflow configuration. The following article will describe how you can create your own module so that Airflow can load it correctly, as well as diagnose problems when modules are not loaded properly.

1. Packages Loading in Python

The list of directories from which Python tries to load the module is given by the variable sys.path. Python really tries to intelligently determine the contents of of this variable, including depending on the operating system and how Python is installed and which Python version is used.

You can check the contents of this variable for the current Python environment by running an interactive terminal as in the example below:

  1. >>> import sys
  2. >>> from pprint import pprint
  3. >>> pprint(sys.path)
  4. ['',
  5. '/home/arch/.pyenv/versions/3.7.4/lib/python37.zip',
  6. '/home/arch/.pyenv/versions/3.7.4/lib/python3.7',
  7. '/home/arch/.pyenv/versions/3.7.4/lib/python3.7/lib-dynload',
  8. '/home/arch/venvs/airflow/lib/python3.7/site-packages']

sys.path is initialized during program startup. The first precedence is given to the current directory, i.e, path[0] is the directory containing the current script that was used to invoke or an empty string in case it was an interactive shell. Second precedence is given to the PYTHONPATH if provided, followed by installation-dependent default paths which is managed by site module.

sys.path can also be modified during a Python session by simply using append (for example, sys.path.append("/path/to/custom/package")). Python will start searching for packages in the newer paths once they’re added. Airflow makes use of this feature as described in the section Additional modules in Airflow.

In the variable sys.path there is a directory site-packages which contains the installed external packages, which means you can install packages with pip or anaconda and you can use them in Airflow. In the next section, you will learn how to create your own simple installable package and how to specify additional directories to be added to sys.path using the environment variable PYTHONPATH.

2. Creating a package in Python

  1. Before starting, install the following packages:
  • setuptools: setuptools is a package development process library designed for creating and distributing Python packages.
  • wheel: The wheel package provides a bdist_wheel command for setuptools. It creates .whl file which is directly installable through the pip install command. We can then upload the same file to PyPI.
  1. $ pip install --upgrade pip setuptools wheel
  1. Create the package directory - in our case, we will call it airflow_operators.

    1. $ mkdir airflow_operators
  2. Create the file __init__.py inside the package and add following code:

    1. print("Hello from airflow_operators")

When we import this package, it should print the above message.

  1. Create setup.py:
    ```bash import setuptools

setuptools.setup( name=’airflow_operators’, )

  1. 5. Build the wheel:<br />
  2. ```bash
  3. $ python setup.py bdist_wheel

This will create a few directories in the project and the overall structure will look like following:

  1. .
  2. ├── airflow_operators
  3. ├── __init__.py
  4. ├── airflow_operators.egg-info
  5. ├── PKG-INFO
  6. ├── SOURCES.txt
  7. ├── dependency_links.txt
  8. └── top_level.txt
  9. ├── build
  10. └── bdist.macosx-10.15-x86_64
  11. ├── dist
  12. └── airflow_operators-0.0.0-py3-none-any.whl
  13. └── setup.py
  1. Install the .whl file using pip:

    1. $ pip install dist/airflow_operators-0.0.0-py3-none-any.whl
  2. The package is now ready to use!

    1. >>> import airflow_operators
    2. Hello from airflow_operators

The package can be removed using pip command:

  1. $ pip uninstall airflow_operators

For more details on how to create to create and publish python packages, see Packaging Python Projects.

3. Adding directories to the path

You can specify additional directories to be added to sys.path using the environment variable PYTHONPATH. Start the python shell by providing the path to root of your project using the following command:

  1. PYTHONPATH=/home/arch/projects/airflow_operators python

The sys.path variable will look like below:

  1. >>> import sys
  2. >>> from pprint import pprint
  3. >>> pprint(sys.path)
  4. ['',
  5. '/home/arch/projects/airflow_operators'
  6. '/home/arch/.pyenv/versions/3.7.4/lib/python37.zip',
  7. '/home/arch/.pyenv/versions/3.7.4/lib/python3.7',
  8. '/home/arch/.pyenv/versions/3.7.4/lib/python3.7/lib-dynload',
  9. '/home/arch/venvs/airflow/lib/python3.7/site-packages']

As we can see that our provided directory is now added to the path, let’s try to import the package now:

  1. >>> import airflow_operators
  2. Hello from airflow_operators

We can also use PYTHONPATH variable with the airflow commands. For example, if we run the following airflow command:

  1. PYTHONPATH=/home/arch/projects/airflow_operators airflow info

We’ll see the Python PATH updated with our mentioned PYTHONPATH value as shown below:

  1. Python PATH: [/home/arch/venv/bin:/home/arch/projects/airflow_operators:/usr/lib/python38.zip:/usr/lib/python3.8:/usr/lib/python3.8/lib-dynload:/home/arch/venv/lib/python3.8/site-packages:/home/arch/airflow/dags:/home/arch/airflow/config:/home/arch/airflow/plugins]

4. Additional modules in Airflow

Airflow adds three additional directories to the sys.path:

  • DAGS folder: It is configured with option dags_folder in section [core].
  • Config folder: It is configured by setting AIRFLOW_HOME variable ({AIRFLOW_HOME}/config) by default.
  • Plugins Folder: It is configured with option plugins_folder in section [core].

You can also see the exact paths using the airflow info command, and use them similar to directories specified with the environment variable PYTHONPATH. An example of the contents of the sys.path variable specified by this command may be as follows:

  1. Python PATH: [/home/rootcss/venvs/airflow/bin:/usr/lib/python38.zip:/usr/lib/python3.8:/usr/lib/python3.8/lib-dynload:/home/rootcss/venvs/airflow/lib/python3.8/site-packages:/home/rootcss/airflow/dags:/home/rootcss/airflow/config:/home/rootcss/airflow/plugins]

Below is the sample output of the airflow info command: :::tips 👀 See also ::: :::info When are plugins (re)loaded? :::

  1. workspace airflow info
  2. Apache Airflow: 2.0.2
  3. System info
  4. OS | Linux
  5. architecture | x86_64
  6. uname | uname_result(system='Linux', node='wd-ws-prod-env-yumingmin-ubuntu18-gpu-649c5857fd-nxbx7',
  7. | release='5.8.14-1.el7.elrepo.x86_64', version='#1 SMP Mon Oct 5 19:02:22 EDT 2020', machine='x86_64',
  8. | processor='x86_64')
  9. locale | ('en_US', 'UTF-8')
  10. python_version | 3.6.10 | packaged by conda-forge | (default, Apr 24 2020, 16:44:11) [GCC 7.3.0]
  11. python_location | /opt/conda/bin/python3.6
  12. Tools info
  13. git | git version 2.17.1
  14. ssh | OpenSSH_7.6p1 Ubuntu-4ubuntu0.3, OpenSSL 1.0.2n 7 Dec 2017
  15. kubectl | NOT AVAILABLE
  16. gcloud | NOT AVAILABLE
  17. cloud_sql_proxy | NOT AVAILABLE
  18. mysql | mysql Ver 14.14 Distrib 5.7.33, for Linux (x86_64) using EditLine wrapper
  19. sqlite3 | 3.34.0 2020-12-01 16:14:00 a26b6597e3ae272231b96f9982c3bcc17ddec2f2b6eb4df06a224b91089fed5b
  20. psql | NOT AVAILABLE
  21. Paths info
  22. airflow_home | /opt/workspace/yumingmin_shared_drive/bigdata/airflow
  23. system_path | /opt/conda/bin:/opt/conda/condabin:/opt/code-server:/opt/conda/bin:/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/
  24. | usr/bin:/sbin:/bin
  25. python_path | /opt/conda/bin:/opt/conda/lib/python36.zip:/opt/conda/lib/python3.6:/opt/conda/lib/python3.6/lib-dynload:/opt/conda/lib/python3.6/site-packages:/opt/wo
  26. | rkspace/yumingmin_shared_drive/bigdata/airflow/dags:/opt/workspace/yumingmin_shared_drive/bigdata/airflow/config:/opt/workspace/yumingmin_shared_drive/
  27. | bigdata/airflow/plugins
  28. airflow_on_path | True
  29. Config info
  30. executor | SequentialExecutor
  31. task_logging_handler | airflow.utils.log.file_task_handler.FileTaskHandler
  32. sql_alchemy_conn | mysql://root:root@localhost:3306/airflow?unix_socket=/var/run/mysqld/mysqld.sock
  33. dags_folder | /opt/workspace/yumingmin_shared_drive/bigdata/airflow/dags
  34. plugins_folder | /opt/workspace/yumingmin_shared_drive/bigdata/airflow/plugins
  35. base_log_folder | /opt/workspace/yumingmin_shared_drive/bigdata/airflow/logs
  36. Providers info
  37. apache-airflow-providers-amazon | 1.4.0
  38. apache-airflow-providers-apache-cas | 1.0.1
  39. sandra |
  40. apache-airflow-providers-apache-dru | 1.1.0
  41. id |
  42. apache-airflow-providers-apache-hdf | 1.0.1
  43. s |
  44. apache-airflow-providers-apache-hiv | 1.0.3
  45. e |
  46. apache-airflow-providers-apache-kyl | 1.0.1
  47. in |
  48. apache-airflow-providers-apache-liv | 1.1.0
  49. y |
  50. apache-airflow-providers-apache-pig | 1.0.1
  51. apache-airflow-providers-apache-pin | 1.0.1
  52. ot |
  53. apache-airflow-providers-apache-spa | 1.0.3
  54. rk |
  55. apache-airflow-providers-apache-sqo | 1.0.1
  56. op |
  57. apache-airflow-providers-celery | 1.0.1
  58. apache-airflow-providers-cloudant | 1.0.1
  59. apache-airflow-providers-cncf-kuber | 1.2.0
  60. netes |
  61. apache-airflow-providers-databricks | 1.0.1
  62. apache-airflow-providers-datadog | 1.0.1
  63. apache-airflow-providers-dingding | 1.0.2
  64. apache-airflow-providers-discord | 1.0.1
  65. apache-airflow-providers-docker | 1.2.0
  66. apache-airflow-providers-elasticsea | 1.0.4
  67. rch |
  68. apache-airflow-providers-exasol | 1.1.1
  69. apache-airflow-providers-facebook | 1.1.0
  70. apache-airflow-providers-ftp | 1.1.0
  71. apache-airflow-providers-grpc | 1.1.0
  72. apache-airflow-providers-hashicorp | 1.0.2
  73. apache-airflow-providers-http | 1.1.1
  74. apache-airflow-providers-imap | 1.0.1
  75. apache-airflow-providers-jdbc | 1.0.1
  76. apache-airflow-providers-jenkins | 1.1.0
  77. apache-airflow-providers-jira | 1.0.2
  78. apache-airflow-providers-microsoft- | 2.0.0
  79. azure |
  80. apache-airflow-providers-microsoft- | 1.1.0
  81. mssql |
  82. apache-airflow-providers-microsoft- | 1.2.0
  83. winrm |
  84. apache-airflow-providers-mongo | 1.0.1
  85. apache-airflow-providers-mysql | 1.1.0
  86. apache-airflow-providers-odbc | 1.0.1
  87. apache-airflow-providers-openfaas | 1.1.1
  88. apache-airflow-providers-opsgenie | 1.0.2
  89. apache-airflow-providers-oracle | 1.1.0
  90. apache-airflow-providers-pagerduty | 1.0.1
  91. apache-airflow-providers-papermill | 1.0.2
  92. apache-airflow-providers-plexus | 1.0.1
  93. apache-airflow-providers-postgres | 1.0.2
  94. apache-airflow-providers-presto | 1.0.2
  95. apache-airflow-providers-qubole | 1.0.2
  96. apache-airflow-providers-redis | 1.0.1
  97. apache-airflow-providers-salesforce | 2.0.0
  98. apache-airflow-providers-samba | 1.0.1
  99. apache-airflow-providers-segment | 1.0.1
  100. apache-airflow-providers-sendgrid | 1.0.2
  101. apache-airflow-providers-sftp | 1.2.0
  102. apache-airflow-providers-singularit | 1.1.0
  103. y |
  104. apache-airflow-providers-slack | 3.0.0
  105. apache-airflow-providers-snowflake | 1.3.0
  106. apache-airflow-providers-sqlite | 1.0.2
  107. apache-airflow-providers-ssh | 1.3.0
  108. apache-airflow-providers-tableau | 1.0.0
  109. apache-airflow-providers-telegram | 1.0.2
  110. apache-airflow-providers-vertica | 1.0.1
  111. apache-airflow-providers-yandex | 1.0.1
  112. apache-airflow-providers-zendesk | 1.0.1