Airflow Interview Questions
30+ Airflow interview questions organized by topic. Focus on Architecture and DAGs — they come up in every data engineering interview.
Architecture
Q: What are the main components of Airflow?
Q: Compare LocalExecutor, CeleryExecutor, and KubernetesExecutor.
Q: What is the metadata database and what does it store?
Q: How does Airflow discover DAGs?
dags_folder (configured in airflow.cfg). Any Python file containing a top-level DAG object or @dag-decorated function is registered. The parse interval is controlled by dag_dir_list_interval (default 300s) and min_file_process_interval.Q: How do you scale Airflow for high-throughput pipelines?
parallelism (global) and max_active_tasks_per_dag. (4) Use Pools to limit resource contention. (5) Run multiple schedulers (Airflow 2.0+). (6) Optimize DAG parsing (reduce top-level code, use .airflowignore). (7) Use a performant metadata DB (PostgreSQL with connection pooling).DAGs & Tasks
Q: What is execution_date / logical_date?
@daily DAG with logical_date 2024-01-15 processes Jan 15 data but actually runs on Jan 16. Airflow 2.2+ renamed it to logical_date with data_interval_start/data_interval_end for clarity.Q: What is the difference between catchup=True and catchup=False?
catchup=True (default): when a DAG is unpaused, Airflow creates DAG runs for all missed intervals since start_date. catchup=False: only creates a run for the latest interval. Use False for pipelines that don't need historical backfill (e.g., alerts). Use True for data pipelines that must process every day.Q: What is the difference between an Operator, a Task, and a Task Instance?
Q: What are Datasets in Airflow? How do they improve inter-DAG dependencies?
Dataset("s3://bucket/orders").Q: How do trigger rules work? When would you use all_done?
all_success. all_done runs after all upstreams finish regardless of state — use for cleanup tasks (deleting temp files, releasing resources) that must always execute even if the pipeline fails.Q: What is Dynamic Task Mapping?
.expand(): process.expand(file=get_files()). Airflow creates N parallel mapped task instances based on the runtime output. Better than loops because the task count is dynamic.Operations & Troubleshooting
Q: A task keeps failing. Walk through your debugging process.
execution_date — is it processing the right interval? (3) Test locally: airflow tasks test dag_id task_id date. (4) Check Connections — are credentials valid? (5) Check resources — is the worker out of memory/disk? (6) Check upstream data — did the source change schema? (7) Check Airflow scheduler logs for parsing errors.Q: How do you handle secrets/credentials in Airflow?
Q: What are Pools and what problem do they solve?
Q: How do you backfill a DAG for a past date range?
airflow dags backfill -s 2024-01-01 -e 2024-01-31 my_dag. This creates DAG runs for each interval in the range. Requirements: tasks must be idempotent (re-running produces correct results). Set max_active_runs to control concurrency during backfill.Scenario Questions
Q: Design a DAG that: extracts from 3 APIs in parallel, transforms each, then loads all into a warehouse, then sends an email.
@dag(schedule="@daily", ...)
def multi_source_etl():
@task
def extract(api: str): ...
@task
def transform(data): ...
@task
def load(all_data): ...
@task
def notify(): ...
results = []
for api in ["users", "orders", "products"]:
raw = extract(api)
clean = transform(raw)
results.append(clean)
load(results) >> notify()
The 3 extract-transform chains run in parallel. load() waits for all 3. notify() runs after load.
Q: How would you set up Airflow to trigger a Power BI dataset refresh after a pipeline completes?
SimpleHttpOperator or PythonOperator to call the Power BI REST API. Steps: (1) Register an Azure AD app with Power BI permissions. (2) Store credentials in an Airflow Connection. (3) Get an OAuth token. (4) POST to datasets/{id}/refreshes. (5) Optionally add a sensor to poll the refresh status until complete.Q: Your DAG processes data for 50 countries. Some fail occasionally. How do you ensure failed countries are retried without reprocessing successful ones?
process.expand(country=get_countries()). Each country becomes its own task instance. Failed ones can be retried independently via the UI ("Clear" just the failed instance). Set retries=2 with retry_delay for automatic retries. The successful countries' results are unaffected.Q: How do you monitor Airflow in production?
on_failure_callback on tasks/DAGs to send Slack/email/PagerDuty alerts. (2) SLAs: sla parameter on tasks to alert when execution exceeds expected duration. (3) Metrics: Airflow exports StatsD/Prometheus metrics (scheduler heartbeat, task duration, pool usage). (4) Health check: /health endpoint. (5) Log aggregation: ship logs to ELK/CloudWatch/Datadog.