dbt-airflow
A python package used to render dbt projects via Airflow DAGs. Every dbt resource type, including models, seeds, snapshots and tests, will be assigned an individual task and the dependencies will automatically be inferred. Additional tasks can also be added either before/after the entire dbt project, or in-between specific dbt tasks.
Installation
The package is available on PyPI:
pip install dbt-airflow
Usage
First, make sure that you have a fresh and up to date manifest.json
metadata file that
is generated by dbt
CLI (hint: See here the list of
commands that will generate one).
from datetime import datetime
from pathlib import Path
from airflow. import DAG
from airflow.operators.dummy import DummyOperator
from dbt_airflow.core.config import DbtAirflowConfig, DbtProjectConfig, DbtProfileConfig
from dbt_airflow.core.task_group import DbtTaskGroup
from dbt_airflow.operators.execution import ExecutionOperator
with DAG(
dag_id='test_dag',
start_date=datetime(2021, 1, 1),
catchup=False,
tags=['example'],
) as dag:
t1 = DummyOperator(task_id='dummy_1')
t2 = DummyOperator(task_id='dummy_2')
tg = DbtTaskGroup(
group_id='dbt-company',
dbt_project_config=DbtProjectConfig(
project_path=Path('/opt/airflow/example_dbt_project/'),
manifest_path=Path('/opt/airflow/example_dbt_project/target/manifest.json'),
),
dbt_profile_config=DbtProfileConfig(
profiles_path=Path('/opt/airflow/example_dbt_project/profiles'),
target='dev',
),
dbt_airflow_config=DbtAirflowConfig(
execution_operator=ExecutionOperator.BASH,
),
)
t1 >> tg >> t2
Things to know
Here's a list of some key aspects and assumptions of the implementation:
- Every dbt project, when compiled, will generate a metadata file under
<dbt-project-dir>/target/manifest.json
- The manifest file contains information about the interdependencies of the project's data models
dbt-airflow
aims to extract these dependencies such that every dbt entity (snapshot, model, test and seed) has its own task in a Airflow DAG while entity dependencies are persisted- Snapshots are never an upstream dependency of any task
- The creation of snapshots on seeds does not make sense, and thus not handled (not even sure if this is even possible on dbt side)
- Models may have tests
- Snapshots may have tests
- Seeds may have tests