Core Concepts of Airflow

TL;DR

Airflow has 6 key pieces: DAGs (workflow definition), Tasks (units of work), Operators (task templates), Scheduler (triggers runs), Executor (runs tasks), and XComs (pass data between tasks). The Scheduler reads your DAG, creates Task Instances, and the Executor runs them.

The Big Picture

Airflow core concepts: DAGs define workflows, Tasks are units of work, Operators are task types, Scheduler triggers runs, Executor runs tasks, XComs pass data
Explain Like I'm 12

DAG = the recipe (steps + order). Task = one step in the recipe ("chop onions"). Operator = the type of step (chopping, boiling, baking). Scheduler = the kitchen timer that says "start cooking at 6pm." Executor = the stove/oven where cooking actually happens. XCom = passing a bowl of chopped onions from one step to the next.

Cheat Sheet

ConceptWhat It DoesAnalogy
DAGDefines workflow: tasks + dependencies + scheduleRecipe with steps in order
TaskA single unit of work in a DAGOne step: "chop onions"
OperatorTemplate for a type of workType of kitchen action
SchedulerMonitors DAGs, triggers runs at the right timeKitchen timer
ExecutorDetermines where/how tasks runThe stove/oven
XComPasses small data between tasksPassing a bowl between steps

The 6 Building Blocks

DAGs — Your Workflow Definition

A DAG (Directed Acyclic Graph) is a Python file that defines what tasks to run, in what order, and on what schedule. "Directed" means tasks flow in one direction. "Acyclic" means no loops.

from airflow import DAG
from airflow.operators.python import PythonOperator
from datetime import datetime

with DAG(
    dag_id="my_etl_pipeline",
    start_date=datetime(2024, 1, 1),
    schedule="@daily",        # Run once per day
    catchup=False,             # Don't backfill past dates
    tags=["etl", "production"]
) as dag:

    extract = PythonOperator(task_id="extract", python_callable=extract_data)
    transform = PythonOperator(task_id="transform", python_callable=transform_data)
    load = PythonOperator(task_id="load", python_callable=load_data)

    extract >> transform >> load   # Set dependencies
Key DAG parameters: start_date (when scheduling begins), schedule (cron expression or preset like @daily), catchup (backfill missed runs?), max_active_runs (concurrency limit), default_args (shared task settings like retries, owner).

Tasks & Operators

A Task is a node in your DAG. An Operator is the class that defines what that task does.

OperatorWhat It DoesExample
PythonOperatorRuns a Python functionTransform data with pandas
BashOperatorRuns a bash commanddbt run --models staging
SqlOperatorExecutes SQL on a databaseINSERT INTO ... SELECT ...
S3ToRedshiftOperatorLoads S3 data into RedshiftCloud data warehouse load
KubernetesPodOperatorRuns a container on K8sSpark job in a pod
EmailOperatorSends an emailAlert on pipeline completion

Task Dependencies

# Bitshift operators (most common)
extract >> transform >> load

# Set upstream/downstream
transform.set_upstream(extract)
load.set_downstream(transform)  # equivalent

# Fan-out / Fan-in
extract >> [transform_a, transform_b] >> load

Task Lifecycle

1
none
Not yet queued
2
scheduled
Dependencies met, waiting for slot
3
queued
Sent to executor
4
running
Executing on a worker
success
failed
up_for_retry

Scheduler — The Brain

The Scheduler is a long-running process that:

  1. Parses DAG files from the dags/ folder
  2. Determines which DAG runs need to be created based on schedule
  3. Checks task dependencies for each DAG run
  4. Sends ready tasks to the Executor
execution_date confusion: In Airflow, execution_date (now logical_date) is the start of the data interval, not when the DAG runs. A daily DAG for 2024-01-15 actually runs on 2024-01-16 (at the end of the interval). This trips up everyone at first.

Executors — Where Tasks Run

ExecutorHow It WorksBest For
SequentialExecutorOne task at a time (default)Development/testing only
LocalExecutorParallel tasks on the same machineSmall-medium workloads
CeleryExecutorDistributes to Celery workers via message queueLarge-scale, multi-node
KubernetesExecutorSpins up a new K8s pod per taskDynamic scaling, isolation
Production rule: Never use SequentialExecutor in production. Start with LocalExecutor. Move to CeleryExecutor or KubernetesExecutor when you need to scale beyond one machine.

XComs — Pass Data Between Tasks

XCom (Cross-Communication) lets tasks share small pieces of data. A task pushes a value, and downstream tasks pull it.

# Push (automatic with return value in TaskFlow API)
@task
def extract():
    data = fetch_from_api()
    return data  # automatically pushed as XCom

# Pull
@task
def transform(data):  # automatically pulled
    return clean(data)
XCom is for metadata, not data. XCom stores values in the Airflow database. Don't push large DataFrames or files — push a file path or S3 key instead. Default limit is 48KB (can be higher with custom backends).

Connections & Variables

Connections store credentials for external systems (database host/port/password, AWS keys, API tokens). Managed in the UI or via environment variables.

Variables are global key-value pairs for configuration (environment name, feature flags, thresholds). Access with Variable.get("key").

Never hardcode secrets in DAG files. Use Connections for credentials and Variables for configuration. Better yet, use a secrets backend (AWS Secrets Manager, HashiCorp Vault).

Test Yourself

Q: What is the difference between a Task and an Operator?

An Operator is a template/class (e.g., PythonOperator). A Task is an instance of that operator with specific parameters, assigned to a DAG. Think: Operator is the cookie cutter, Task is the cookie.

Q: What does "Directed Acyclic" mean in DAG?

Directed: tasks flow in one direction (upstream → downstream). Acyclic: no cycles — you can't have task A depend on task B which depends on task A. This guarantees the workflow has a clear start and end.

Q: When should you NOT use XCom?

For large data (DataFrames, files, images). XCom stores in the metadata database with a small size limit. Instead, write data to S3/GCS/a database and pass the path/key via XCom.

Interview Questions

Q: Explain execution_date. Why does a daily DAG for Jan 15 run on Jan 16?

execution_date (now logical_date) is the start of the data interval, not the run time. A daily DAG processes data FOR a given day. It runs at the END of that interval (start of the next). So data for Jan 15 (00:00-23:59) is processed when the interval closes on Jan 16 at 00:00.

Q: What happens when a task fails? How does retry work?

On failure: task state becomes failed. If retries > 0, it moves to up_for_retry, waits retry_delay, then re-queues. After exhausting retries, it stays failed. Downstream tasks are skipped (unless trigger_rule is set differently). You can configure on_failure_callback for alerts (Slack, email).

Q: Compare CeleryExecutor vs KubernetesExecutor.

CeleryExecutor: tasks dispatched to a pool of persistent workers via message queue (Redis/RabbitMQ). Fast startup, fixed resource pool. KubernetesExecutor: spins up a new pod per task, destroys after. Perfect isolation, dynamic scaling, but slower startup. Choose Celery for many small fast tasks; K8s for diverse resource needs and isolation.

Q: How do you handle dependencies between DAGs (not just tasks)?

Options: (1) TriggerDagRunOperator — one DAG triggers another. (2) ExternalTaskSensor — wait for a task in another DAG to complete. (3) Datasets (Airflow 2.4+) — DAG A produces a dataset, DAG B is triggered when the dataset is updated. Datasets are the modern, preferred approach.