Airflow Interview Questions

TL;DR

30+ Airflow interview questions organized by topic. Focus on Architecture and DAGs — they come up in every data engineering interview.

Short on time? Focus on Architecture, DAGs & Tasks, and Scenario Questions.

Architecture

Q: What are the main components of Airflow?

Scheduler: triggers DAG runs and tasks. Executor: determines where tasks run. Workers: execute tasks. Web Server: serves the UI. Metadata Database: stores state (PostgreSQL/MySQL). DAG Directory: folder containing DAG Python files.

Q: Compare LocalExecutor, CeleryExecutor, and KubernetesExecutor.

Local: parallel tasks on one machine, no external deps. Celery: distributed to persistent workers via Redis/RabbitMQ, fast startup, fixed pool. Kubernetes: new pod per task, perfect isolation, dynamic scaling, slower startup. Local for small; Celery for large steady workloads; K8s for diverse/bursty workloads.

Q: What is the metadata database and what does it store?

A relational database (PostgreSQL or MySQL) that stores: DAG definitions and their state, task instances and their status/history, XCom values, Variables, Connections, user info, and audit logs. It's the source of truth for the entire Airflow installation.

Q: How does Airflow discover DAGs?

The Scheduler periodically scans the dags_folder (configured in airflow.cfg). Any Python file containing a top-level DAG object or @dag-decorated function is registered. The parse interval is controlled by dag_dir_list_interval (default 300s) and min_file_process_interval.

Q: How do you scale Airflow for high-throughput pipelines?

(1) Switch to CeleryExecutor or KubernetesExecutor. (2) Add more workers. (3) Increase parallelism (global) and max_active_tasks_per_dag. (4) Use Pools to limit resource contention. (5) Run multiple schedulers (Airflow 2.0+). (6) Optimize DAG parsing (reduce top-level code, use .airflowignore). (7) Use a performant metadata DB (PostgreSQL with connection pooling).

DAGs & Tasks

Q: What is execution_date / logical_date?

The start of the data interval being processed, NOT when the DAG runs. A @daily DAG with logical_date 2024-01-15 processes Jan 15 data but actually runs on Jan 16. Airflow 2.2+ renamed it to logical_date with data_interval_start/data_interval_end for clarity.

Q: What is the difference between catchup=True and catchup=False?

catchup=True (default): when a DAG is unpaused, Airflow creates DAG runs for all missed intervals since start_date. catchup=False: only creates a run for the latest interval. Use False for pipelines that don't need historical backfill (e.g., alerts). Use True for data pipelines that must process every day.

Q: What is the difference between an Operator, a Task, and a Task Instance?

Operator: class defining work type (PythonOperator, BashOperator). Task: an operator instance with specific args, a node in the DAG. Task Instance: a specific run of a task for a specific execution_date. Task is the definition; Task Instance is one execution of it.

Q: What are Datasets in Airflow? How do they improve inter-DAG dependencies?

Datasets (Airflow 2.4+) are URIs that represent data produced by tasks. A DAG can declare it produces a dataset; another DAG can declare it's triggered when that dataset is updated. Better than ExternalTaskSensor because: (1) no polling, (2) works across DAGs naturally, (3) visible in the UI's Dataset tab. Example: Dataset("s3://bucket/orders").

Q: How do trigger rules work? When would you use all_done?

Trigger rules determine when a task is eligible to run based on upstream states. Default is all_success. all_done runs after all upstreams finish regardless of state — use for cleanup tasks (deleting temp files, releasing resources) that must always execute even if the pipeline fails.

Q: What is Dynamic Task Mapping?

Airflow 2.3+ feature that creates tasks at runtime from a list. Instead of generating tasks at parse time, you use .expand(): process.expand(file=get_files()). Airflow creates N parallel mapped task instances based on the runtime output. Better than loops because the task count is dynamic.

Operations & Troubleshooting

Q: A task keeps failing. Walk through your debugging process.

(1) Check task logs in the UI (most issues are here). (2) Check execution_date — is it processing the right interval? (3) Test locally: airflow tasks test dag_id task_id date. (4) Check Connections — are credentials valid? (5) Check resources — is the worker out of memory/disk? (6) Check upstream data — did the source change schema? (7) Check Airflow scheduler logs for parsing errors.

Q: How do you handle secrets/credentials in Airflow?

(1) Connections for database/API credentials (stored encrypted in metadata DB). (2) Variables for config values. (3) Secrets Backend for production: AWS Secrets Manager, HashiCorp Vault, GCP Secret Manager. The backend is checked first, falling back to the DB. Never hardcode secrets in DAG files.

Q: What are Pools and what problem do they solve?

Pools limit the number of concurrent task instances for a resource. Example: a database can handle 5 concurrent connections — create a pool with 5 slots and assign DB tasks to it. Without pools, 50 tasks could hit the DB simultaneously and overwhelm it.

Q: How do you backfill a DAG for a past date range?

airflow dags backfill -s 2024-01-01 -e 2024-01-31 my_dag. This creates DAG runs for each interval in the range. Requirements: tasks must be idempotent (re-running produces correct results). Set max_active_runs to control concurrency during backfill.

Scenario Questions

Q: Design a DAG that: extracts from 3 APIs in parallel, transforms each, then loads all into a warehouse, then sends an email.

@dag(schedule="@daily", ...)
def multi_source_etl():
    @task
    def extract(api: str): ...
    @task
    def transform(data): ...
    @task
    def load(all_data): ...
    @task
    def notify(): ...

    results = []
    for api in ["users", "orders", "products"]:
        raw = extract(api)
        clean = transform(raw)
        results.append(clean)
    load(results) >> notify()

The 3 extract-transform chains run in parallel. load() waits for all 3. notify() runs after load.

Q: How would you set up Airflow to trigger a Power BI dataset refresh after a pipeline completes?

Use a SimpleHttpOperator or PythonOperator to call the Power BI REST API. Steps: (1) Register an Azure AD app with Power BI permissions. (2) Store credentials in an Airflow Connection. (3) Get an OAuth token. (4) POST to datasets/{id}/refreshes. (5) Optionally add a sensor to poll the refresh status until complete.

Q: Your DAG processes data for 50 countries. Some fail occasionally. How do you ensure failed countries are retried without reprocessing successful ones?

Use Dynamic Task Mapping: process.expand(country=get_countries()). Each country becomes its own task instance. Failed ones can be retried independently via the UI ("Clear" just the failed instance). Set retries=2 with retry_delay for automatic retries. The successful countries' results are unaffected.

Q: How do you monitor Airflow in production?

(1) Alerting: on_failure_callback on tasks/DAGs to send Slack/email/PagerDuty alerts. (2) SLAs: sla parameter on tasks to alert when execution exceeds expected duration. (3) Metrics: Airflow exports StatsD/Prometheus metrics (scheduler heartbeat, task duration, pool usage). (4) Health check: /health endpoint. (5) Log aggregation: ship logs to ELK/CloudWatch/Datadog.