What is Apache Airflow?

TL;DR

Airflow is an open-source workflow orchestrator. You define your data pipelines as Python code (DAGs), Airflow schedules and runs them, and you monitor everything through a web UI. Think of it as a smart cron job on steroids that knows about dependencies.

The Big Picture

Airflow orchestrates work. You tell it what to do, when to do it, and in what order.

Airflow architecture: You write DAGs in Python, the Scheduler triggers tasks, the Executor runs them on Workers, metadata goes to the Database, and the Web UI shows status
Explain Like I'm 12

Imagine you're the manager of a factory assembly line. You have a clipboard with the order of steps: first cut the wood, then sand it, then paint it, then dry it. Some steps can happen at the same time (cutting two boards). Some must wait (can't paint before sanding).

Airflow is that clipboard — but automatic. You write the steps and their order in Python. Airflow runs them on schedule, watches for failures, retries broken steps, and shows you a dashboard of everything that's happening. You never have to stand over the assembly line again.

What is Airflow?

Apache Airflow was created at Airbnb in 2014 and became an Apache top-level project in 2019. It's the most widely used workflow orchestrator in data engineering.

Key characteristics:

  • Workflows as code — DAGs are Python files, so you get version control, code review, and testing for free
  • Dependency-aware — tasks run only when their upstream dependencies succeed
  • Scheduled — cron-like scheduling with catchup for missed runs
  • Extensible — 1000+ pre-built operators for AWS, GCP, Azure, databases, APIs, and more
  • Observable — rich web UI with DAG visualizations, logs, and task history

Who is it for?

Data engineers building ETL/ELT pipelines. ML engineers orchestrating training jobs. Analytics engineers scheduling dbt runs. DevOps teams automating infrastructure tasks. Anyone who has outgrown cron jobs and needs visibility into complex, multi-step workflows.

What can Airflow do?

  • ETL/ELT pipelines — extract data from APIs, transform in Spark/SQL, load to warehouse
  • ML pipelines — train models, validate, deploy to production
  • Data quality checks — run tests on data after each load
  • Report generation — build and email reports on a schedule
  • Infrastructure automation — spin up/down clusters, manage resources
  • Cross-system orchestration — coordinate tasks across Spark, Kubernetes, cloud services, databases

Airflow vs Others

FeatureAirflowPrefectDagsterCron
LanguagePythonPythonPythonShell
UIRich web UICloud UIDagit UINone
DependenciesDAG-basedFlow-basedAsset-basedNone
RetriesBuilt-inBuilt-inBuilt-inManual
EcosystemLargest (1000+ operators)GrowingGrowingN/A
Best forComplex pipelines at scaleModern Python workflowsData asset managementSimple recurring jobs

What you'll learn

Start Learning: Core Concepts →

Test Yourself

What does DAG stand for?

Directed Acyclic Graph — a collection of tasks with dependencies, where the flow goes in one direction with no cycles (a task can't depend on itself, directly or indirectly).

Why is Airflow better than cron for data pipelines?

Airflow provides: dependency management between tasks, built-in retries, a web UI for monitoring, logging, alerting on failures, backfill capabilities, and an ecosystem of pre-built operators. Cron just runs commands on a schedule with no awareness of dependencies or failures.

What language are Airflow DAGs written in?

Python. DAGs are regular Python files, which means you can use loops, conditionals, imports, and any Python library to generate your workflows dynamically.