Core Concepts of Airflow

TL;DR

Airflow has 6 key pieces: DAGs (workflow definition), Tasks (units of work), Operators (task templates), Scheduler (triggers runs), Executor (runs tasks), and XComs (pass data between tasks). The Scheduler reads your DAG, creates Task Instances, and the Executor runs them.

The Big Picture

Airflow core concepts: DAGs define workflows, Tasks are units of work, Operators are task types, Scheduler triggers runs, Executor runs tasks, XComs pass data

Explain Like I'm 12

DAG = the recipe (steps + order). Task = one step in the recipe ("chop onions"). Operator = the type of step (chopping, boiling, baking). Scheduler = the kitchen timer that says "start cooking at 6pm." Executor = the stove/oven where cooking actually happens. XCom = passing a bowl of chopped onions from one step to the next.

Cheat Sheet

Concept	What It Does	Analogy
DAG	Defines workflow: tasks + dependencies + schedule	Recipe with steps in order
Task	A single unit of work in a DAG	One step: "chop onions"
Operator	Template for a type of work	Type of kitchen action
Scheduler	Monitors DAGs, triggers runs at the right time	Kitchen timer
Executor	Determines where/how tasks run	The stove/oven
XCom	Passes small data between tasks	Passing a bowl between steps

The 6 Building Blocks

DAGs — Your Workflow Definition

A DAG (Directed Acyclic Graph) is a Python file that defines what tasks to run, in what order, and on what schedule. "Directed" means tasks flow in one direction. "Acyclic" means no loops.

from airflow import DAG
from airflow.operators.python import PythonOperator
from datetime import datetime

with DAG(
    dag_id="my_etl_pipeline",
    start_date=datetime(2024, 1, 1),
    schedule="@daily",        # Run once per day
    catchup=False,             # Don't backfill past dates
    tags=["etl", "production"]
) as dag:

    extract = PythonOperator(task_id="extract", python_callable=extract_data)
    transform = PythonOperator(task_id="transform", python_callable=transform_data)
    load = PythonOperator(task_id="load", python_callable=load_data)

    extract >> transform >> load   # Set dependencies

Key DAG parameters: start_date (when scheduling begins), schedule (cron expression or preset like @daily), catchup (backfill missed runs?), max_active_runs (concurrency limit), default_args (shared task settings like retries, owner).

Tasks & Operators

A Task is a node in your DAG. An Operator is the class that defines what that task does.

Operator	What It Does	Example
`PythonOperator`	Runs a Python function	Transform data with pandas
`BashOperator`	Runs a bash command	`dbt run --models staging`
`SqlOperator`	Executes SQL on a database	`INSERT INTO ... SELECT ...`
`S3ToRedshiftOperator`	Loads S3 data into Redshift	Cloud data warehouse load
`KubernetesPodOperator`	Runs a container on K8s	Spark job in a pod
`EmailOperator`	Sends an email	Alert on pipeline completion

Task Dependencies

# Bitshift operators (most common)
extract >> transform >> load

# Set upstream/downstream
transform.set_upstream(extract)
load.set_downstream(transform)  # equivalent

# Fan-out / Fan-in
extract >> [transform_a, transform_b] >> load

Task Lifecycle

none

Not yet queued

scheduled

Dependencies met, waiting for slot

queued

Sent to executor

running

Executing on a worker

✓

success

✗

failed

↻

up_for_retry

Scheduler — The Brain

The Scheduler is a long-running process that:

Parses DAG files from the dags/ folder
Determines which DAG runs need to be created based on schedule
Checks task dependencies for each DAG run
Sends ready tasks to the Executor

execution_date confusion: In Airflow, execution_date (now logical_date) is the start of the data interval, not when the DAG runs. A daily DAG for 2024-01-15 actually runs on 2024-01-16 (at the end of the interval). This trips up everyone at first.

Executors — Where Tasks Run

Executor	How It Works	Best For
`SequentialExecutor`	One task at a time (default)	Development/testing only
`LocalExecutor`	Parallel tasks on the same machine	Small-medium workloads
`CeleryExecutor`	Distributes to Celery workers via message queue	Large-scale, multi-node
`KubernetesExecutor`	Spins up a new K8s pod per task	Dynamic scaling, isolation

Production rule: Never use SequentialExecutor in production. Start with LocalExecutor. Move to CeleryExecutor or KubernetesExecutor when you need to scale beyond one machine.

XComs — Pass Data Between Tasks

XCom (Cross-Communication) lets tasks share small pieces of data. A task pushes a value, and downstream tasks pull it.

# Push (automatic with return value in TaskFlow API)
@task
def extract():
    data = fetch_from_api()
    return data  # automatically pushed as XCom

# Pull
@task
def transform(data):  # automatically pulled
    return clean(data)

XCom is for metadata, not data. XCom stores values in the Airflow database. Don't push large DataFrames or files — push a file path or S3 key instead. Default limit is 48KB (can be higher with custom backends).

Connections & Variables

Connections store credentials for external systems (database host/port/password, AWS keys, API tokens). Managed in the UI or via environment variables.

Variables are global key-value pairs for configuration (environment name, feature flags, thresholds). Access with Variable.get("key").

Never hardcode secrets in DAG files. Use Connections for credentials and Variables for configuration. Better yet, use a secrets backend (AWS Secrets Manager, HashiCorp Vault).

Test Yourself

Q: What is the difference between a Task and an Operator?

An Operator is a template/class (e.g., PythonOperator). A Task is an instance of that operator with specific parameters, assigned to a DAG. Think: Operator is the cookie cutter, Task is the cookie.

Q: What does "Directed Acyclic" mean in DAG?

Directed: tasks flow in one direction (upstream → downstream). Acyclic: no cycles — you can't have task A depend on task B which depends on task A. This guarantees the workflow has a clear start and end.

Q: When should you NOT use XCom?

For large data (DataFrames, files, images). XCom stores in the metadata database with a small size limit. Instead, write data to S3/GCS/a database and pass the path/key via XCom.

Interview Questions

Q: Explain execution_date. Why does a daily DAG for Jan 15 run on Jan 16?

execution_date (now logical_date) is the start of the data interval, not the run time. A daily DAG processes data FOR a given day. It runs at the END of that interval (start of the next). So data for Jan 15 (00:00-23:59) is processed when the interval closes on Jan 16 at 00:00.

Q: What happens when a task fails? How does retry work?

On failure: task state becomes failed. If retries > 0, it moves to up_for_retry, waits retry_delay, then re-queues. After exhausting retries, it stays failed. Downstream tasks are skipped (unless trigger_rule is set differently). You can configure on_failure_callback for alerts (Slack, email).

Q: Compare CeleryExecutor vs KubernetesExecutor.

CeleryExecutor: tasks dispatched to a pool of persistent workers via message queue (Redis/RabbitMQ). Fast startup, fixed resource pool. KubernetesExecutor: spins up a new pod per task, destroys after. Perfect isolation, dynamic scaling, but slower startup. Choose Celery for many small fast tasks; K8s for diverse resource needs and isolation.

Q: How do you handle dependencies between DAGs (not just tasks)?

Options: (1) TriggerDagRunOperator — one DAG triggers another. (2) ExternalTaskSensor — wait for a task in another DAG to complete. (3) Datasets (Airflow 2.4+) — DAG A produces a dataset, DAG B is triggered when the dataset is updated. Datasets are the modern, preferred approach.