airflow.providers.apache.beam.operators.beam¶

此模块包含 Apache Beam 操作符。

类¶

`BeamDataflowMixin`	用于存储通用 Dataflow 特定逻辑的辅助类。
`BeamBasePipelineOperator`	Beam 流水线操作符的抽象基类。
`BeamRunPythonPipelineOperator`	启动用 Python 编写的 Apache Beam 流水线。
`BeamRunJavaPipelineOperator`	启动用 Java 编写的 Apache Beam 流水线。
`BeamRunGoPipelineOperator`	启动用 Go 编写的 Apache Beam 流水线。

模块内容¶

class airflow.providers.apache.beam.operators.beam.BeamDataflowMixin[source]¶

用于存储通用 Dataflow 特定逻辑的辅助类。

BeamRunPythonPipelineOperator, BeamRunJavaPipelineOperator 和 BeamRunGoPipelineOperator。

dataflow_hook: airflow.providers.google.cloud.hooks.dataflow.DataflowHook | None[source]¶

dataflow_config: airflow.providers.google.cloud.operators.dataflow.DataflowConfiguration[source]¶

gcp_conn_id: str[source]¶

dataflow_support_impersonation: bool = True[source]¶

class airflow.providers.apache.beam.operators.beam.BeamBasePipelineOperator(*, runner='DirectRunner', default_pipeline_options=None, pipeline_options=None, gcp_conn_id='google_cloud_default', dataflow_config=None, **kwargs)[source]¶

基类: airflow.models.BaseOperator, BeamDataflowMixin, abc.ABC

Beam 流水线操作符的抽象基类。

参数:

runner (str) – 流水线将运行的 Runner。默认使用“DirectRunner”。其他可能的选项：DataflowRunner, SparkRunner, FlinkRunner, PortableRunner。参见: BeamRunnerType 参见: https://beam.apache.org/documentation/runners/capability-matrix/
default_pipeline_options (dict | None) – 默认流水线选项的映射。
pipeline_options (dict | None) –
流水线选项的映射。键是字典中的键。值可以包含不同的类型
- 如果值为 None，则将添加单个选项 - --key（不带值）。
- 如果值为 False，则跳过此选项
- 如果值为 True，则将添加单个选项 - --key（不带值）。
- 如果值为 list，则会为每个键添加多个选项。如果值为 ['A', 'B'] 且键为 key，则会添加 --key=A --key=B 选项
- 其他值类型将替换为 Python 的文本表示形式。
定义标签（labels 选项）时，也可以提供一个字典。
gcp_conn_id (str) – 可选。连接到 Google Cloud Storage 时使用的连接 ID，如果 Python 文件位于 GCS 上。
dataflow_config (airflow.providers.google.cloud.operators.dataflow.DataflowConfiguration | dict | None) – Dataflow 的配置，当 runner 类型设置为 DataflowRunner 时使用，(可选) 默认为 None。

runner = 'DirectRunner'[source]¶

default_pipeline_options[source]¶

pipeline_options[source]¶

dataflow_config[source]¶

gcp_conn_id = 'google_cloud_default'[source]¶

beam_hook: airflow.providers.apache.beam.hooks.beam.BeamHook[source]¶

dataflow_hook: airflow.providers.google.cloud.hooks.dataflow.DataflowHook | None = None[source]¶

property dataflow_job_id[source]¶

execute_complete(context, event)[source]¶

触发器触发时执行 - 立即返回。

依赖触发器抛出异常，否则假定执行成功。

class airflow.providers.apache.beam.operators.beam.BeamRunPythonPipelineOperator(*, py_file, runner='DirectRunner', default_pipeline_options=None, pipeline_options=None, py_interpreter='python3', py_options=None, py_requirements=None, py_system_site_packages=False, gcp_conn_id='google_cloud_default', dataflow_config=None, deferrable=conf.getboolean('operators', 'default_deferrable', fallback=False), **kwargs)[source]¶

基类: BeamBasePipelineOperator

启动用 Python 编写的 Apache Beam 流水线。

请注意，default_pipeline_options 和 pipeline_options 都将合并以指定流水线执行参数，default_pipeline_options 预期用于保存高级选项，例如项目和区域信息，这些选项适用于 DAG 中的所有 Beam 操作符。

另请参阅

有关如何使用此操作符的更多信息，请参阅指南: 在 Apache Beam 中运行 Python 流水线

另请参阅

有关 Apache Beam 的更多详细信息，请参阅参考: https://beam.apache.org/documentation/

参数:

py_file (str) – 对 Apache Beam Python 流水线文件 .py 的引用，例如，/some/local/file/path/to/your/python/pipeline/file。(模板化)
py_options (list[str] | None) – 其他 Python 选项，例如，["-m", "-v"]。
py_interpreter (str) – Beam 流水线的 Python 版本。如果为 None，默认为 python3。要跟踪 Beam 支持的 Python 版本和相关问题，请查看: https://issues.apache.org/jira/browse/BEAM-1251
py_requirements (list[str] | None) –
要安装的其他 Python 包。如果为此参数传递了值，将创建一个新的虚拟环境并安装额外的包。

如果您的系统上未安装 apache_beam 包或您想使用不同的版本，您也可以安装它。
py_system_site_packages (bool) – 是否在您的 virtualenv 中包含 system_site_packages。有关更多信息，请参阅 virtualenv 文档。此选项仅在 py_requirements 参数不为 None 时相关。
deferrable (bool) – 在可推迟模式下运行操作符：使用异步调用检查状态。

template_fields: collections.abc.Sequence[str] = ('py_file', 'runner', 'pipeline_options', 'default_pipeline_options', 'dataflow_config')[source]¶

template_fields_renderers[source]¶

operator_extra_links[source]¶

py_file[source]¶

py_options = [][source]¶

py_interpreter = 'python3'[source]¶

py_requirements = None[source]¶

py_system_site_packages = False[source]¶

deferrable = True[source]¶

execute(context)[source]¶

执行 Apache Beam Python 流水线。

execute_on_dataflow(context)[source]¶

在 Dataflow runner 上执行 Apache Beam 流水线。

on_kill()[source]¶

覆盖此方法以在任务实例被终止时清理子进程。

在操作符内部使用 threading、subprocess 或 multiprocessing 模块时，需要进行清理，否则会留下僵尸进程。

class airflow.providers.apache.beam.operators.beam.BeamRunJavaPipelineOperator(*, jar, runner='DirectRunner', job_class=None, default_pipeline_options=None, pipeline_options=None, gcp_conn_id='google_cloud_default', dataflow_config=None, deferrable=conf.getboolean('operators', 'default_deferrable', fallback=False), **kwargs)[source]¶

基类: BeamBasePipelineOperator

启动用 Java 编写的 Apache Beam 流水线。

请注意，default_pipeline_options 和 pipeline_options 都将合并以指定流水线执行参数，default_pipeline_options 预期用于保存高级 pipeline_options，例如项目和区域信息，这些选项适用于 DAG 中的所有 Apache Beam 操作符。

另请参阅

有关如何使用此操作符的更多信息，请参阅指南: 在 Apache Beam 中运行 Java 流水线

另请参阅

有关 Apache Beam 的更多详细信息，请参阅参考: https://beam.apache.org/documentation/

您需要使用 jar 参数将 jar 文件路径作为文件引用传递，该 jar 需要是自执行 jar（参见此处文档: https://beam.apache.org/documentation/runners/dataflow/#self-executing-jar）。使用 pipeline_options 将 pipeline_options 传递给您的作业。

参数:

jar (str) – 对自执行 Apache Beam jar 的引用（模板化）。
job_class (str | None) – 要执行的 Apache Beam 流水线类的名称，这通常不是流水线 jar 文件中配置的主类。

template_fields: collections.abc.Sequence[str] = ('jar', 'runner', 'job_class', 'pipeline_options', 'default_pipeline_options', 'dataflow_config')[source]¶

template_fields_renderers[source]¶

ui_color = '#0273d4'[source]¶

operator_extra_links[source]¶

jar[source]¶

job_class = None[source]¶

deferrable = True[source]¶

execute(context)[source]¶

执行 Apache Beam Python 流水线。

execute_on_dataflow(context)[source]¶

在 Dataflow runner 上执行 Apache Beam 流水线。

on_kill()[source]¶

覆盖此方法以在任务实例被终止时清理子进程。

在操作符内部使用 threading、subprocess 或 multiprocessing 模块时，需要进行清理，否则会留下僵尸进程。

class airflow.providers.apache.beam.operators.beam.BeamRunGoPipelineOperator(*, go_file='', launcher_binary='', worker_binary='', runner='DirectRunner', default_pipeline_options=None, pipeline_options=None, gcp_conn_id='google_cloud_default', dataflow_config=None, **kwargs)[source]¶

基类: BeamBasePipelineOperator

启动用 Go 编写的 Apache Beam 流水线。

请注意，default_pipeline_options 和 pipeline_options 都将合并以指定流水线执行参数，default_pipeline_options 预期用于保存高级选项，例如项目和区域信息，这些选项适用于 DAG 中的所有 Beam 操作符。

另请参阅

有关如何使用此运算符的更多信息，请参阅指南：在 Apache Beam 中运行 Go 管线

另请参阅

有关 Apache Beam 的更多详细信息，请参阅参考: https://beam.apache.org/documentation/

参数:

go_file (str) – 指向 Apache Beam 管线 Go 源代码文件的引用，例如 /local/path/to/main.go 或 gs://bucket/path/to/main.go。必须且只能提供 go_file 和 launcher_binary 其中之一。
launcher_binary (str) – 指向针对启动平台编译的 Apache Beam 管线 Go 二进制文件的引用，例如 /local/path/to/launcher-main 或 gs://bucket/path/to/launcher-main。必须且只能提供 go_file 和 launcher_binary 其中之一。
worker_binary (str) – 指向针对 worker 平台编译的 Apache Beam 管线 Go 二进制文件的引用，例如 /local/path/to/worker-main 或 gs://bucket/path/to/worker-main。如果运行管线的 worker 的操作系统或架构与启动管线的平台的操作系统或架构不同，则需要此参数。更多信息，请参阅 Apache Beam 关于 Go 交叉编译的文档：https://beam.apache.org/documentation/sdks/go-cross-compilation/。如果未设置 launcher_binary，则提供 worker_binary 将无效。如果设置了 launcher_binary 但未设置 worker_binary，则 worker_binary 将默认为 launcher_binary 的值。

template_fields = ['go_file', 'launcher_binary', 'worker_binary', 'runner', 'pipeline_options',...[source]¶

template_fields_renderers[source]¶

operator_extra_links[source]¶

go_file = ''[source]¶

launcher_binary = ''[source]¶

worker_binary = ''[source]¶

execute(context)[source]¶

执行 Apache Beam 管线。

on_kill()[source]¶

覆盖此方法以在任务实例被终止时清理子进程。

在操作符内部使用 threading、subprocess 或 multiprocessing 模块时，需要进行清理，否则会留下僵尸进程。