Skip to content

dbt-airflow

A python package used to render dbt projects via Airflow DAGs. Every dbt resource type, including models, seeds, snapshots and tests, will be assigned an individual task and the dependencies will automatically be inferred. Additional tasks can also be added either before/after the entire dbt project, or in-between specific dbt tasks.

Installation

The package is available on PyPI:

pip install dbt-airflow

Usage

First, make sure that you have a fresh and up to date manifest.json metadata file that is generated by dbt CLI (hint: See here the list of commands that will generate one).

from datetime import datetime
from pathlib import Path

from airflow. import DAG
from airflow.operators.dummy import DummyOperator

from dbt_airflow.core.config import DbtAirflowConfig, DbtProjectConfig, DbtProfileConfig
from dbt_airflow.core.task_group import DbtTaskGroup
from dbt_airflow.operators.execution import ExecutionOperator


with DAG(
    dag_id='test_dag',
    start_date=datetime(2021, 1, 1),
    catchup=False,
    tags=['example'],
) as dag:

    t1 = DummyOperator(task_id='dummy_1')
    t2 = DummyOperator(task_id='dummy_2')

    tg = DbtTaskGroup(
        group_id='dbt-company',
        dbt_project_config=DbtProjectConfig(
            project_path=Path('/opt/airflow/example_dbt_project/'),
            manifest_path=Path('/opt/airflow/example_dbt_project/target/manifest.json'),
        ),
        dbt_profile_config=DbtProfileConfig(
            profiles_path=Path('/opt/airflow/example_dbt_project/profiles'),
            target='dev',
        ),
        dbt_airflow_config=DbtAirflowConfig(
            execution_operator=ExecutionOperator.BASH,
        ),
    )

    t1 >> tg >> t2    

Things to know

Here's a list of some key aspects and assumptions of the implementation:

  • Every dbt project, when compiled, will generate a metadata file under <dbt-project-dir>/target/manifest.json
  • The manifest file contains information about the interdependencies of the project's data models
  • dbt-airflow aims to extract these dependencies such that every dbt entity (snapshot, model, test and seed) has its own task in a Airflow DAG while entity dependencies are persisted
  • Snapshots are never an upstream dependency of any task
  • The creation of snapshots on seeds does not make sense, and thus not handled (not even sure if this is even possible on dbt side)
  • Models may have tests
  • Snapshots may have tests
  • Seeds may have tests