dbt Interview Questions

TL;DR

30+ dbt interview questions organized by topic. Focus on Models & Materializations and Testing — they come up in every analytics engineering interview.

Short on time? Focus on Models & Materializations and Scenario Questions.

Fundamentals

Q: What is dbt and how does it fit in the modern data stack?

dbt is the transformation layer in ELT. Sources (Fivetran/Airbyte) extract and load raw data into the warehouse. dbt transforms it inside the warehouse using SQL. BI tools (Power BI, Looker) consume the clean output. dbt only does the "T" — no extraction or loading.

Q: What is the difference between ref() and source()?

ref('model_name') references another dbt model. source('source_name', 'table_name') references a raw table from an external loader. Both build the DAG. source() also enables freshness monitoring. Never hardcode table names.

Q: What is dbt Core vs dbt Cloud?

Core: free OSS CLI, run anywhere (locally, CI/CD, Airflow). Cloud: managed SaaS with web IDE, scheduling, CI on PRs, docs hosting, semantic layer, team collaboration. Many teams use Core in Airflow for orchestration, or Cloud for end-to-end.

Q: What are seeds in dbt?

Seeds are CSV files in the seeds/ directory that dbt loads as tables. Use for small, static lookup data: country codes, status mappings, category hierarchies. Run with dbt seed. Not for large datasets — those should come through your EL tool.

Models & Materializations

Q: Explain the 4 materialization types. When do you use each?

view: SQL view, no stored data. Fast build, slow query. Use for staging. table: full table rebuild each run. Slow build, fast query. Use for marts. incremental: only process new/changed rows. Fast build + fast query. Use for large event tables. ephemeral: CTE injected into parent query, not materialized. Use to avoid intermediate tables.

Q: How do incremental models work? What happens on the first run?

First run: full table build. Subsequent runs: the is_incremental() block filters to only new/changed rows. The unique_key config determines upsert behavior (MERGE or DELETE+INSERT). Use --full-refresh to force a complete rebuild when logic changes.

Q: What are the 3 model layers and their naming conventions?

Staging (stg_{source}__{table}): 1:1 with source, rename/cast/clean, materialized as view. Intermediate (int_{entity}__{verb}): joins/transforms, ephemeral or view. Marts (fct_ / dim_): business-ready, materialized as table. Marts organized by domain (finance/, marketing/).

Q: What is an ephemeral model and when would you use one?

An ephemeral model isn't materialized in the warehouse. Instead, its SQL is injected as a CTE into whatever model references it. Use when you want to share logic between models without creating intermediate tables. Good for simple transformations referenced by one or two downstream models.

Deeper coverage: Model Layers Deep Dive

Testing & Quality

Q: What types of tests does dbt support?

Generic tests: YAML-defined (unique, not_null, accepted_values, relationships). Singular tests: custom SQL queries in the tests/ directory that return rows that fail. Custom generic tests: Jinja macros that can be applied like built-in tests. dbt packages: dbt-expectations adds 50+ tests (e.g., expect_column_values_to_be_between).

Q: How do you implement data quality checks in a dbt pipeline?

(1) Generic tests on every model (unique, not_null on PKs). (2) Singular tests for business rules (no negative revenue). (3) Source freshness checks (dbt source freshness). (4) dbt build runs tests after each model — failures block downstream. (5) dbt-expectations for statistical tests. (6) Elementary or re_data for anomaly detection.

Q: What is source freshness and how do you configure it?

Source freshness checks if raw data is being loaded on time. Configure in YAML with loaded_at_field (timestamp column) and freshness thresholds (warn_after, error_after). Run dbt source freshness to check. If the latest row is older than the threshold, it warns/errors. Essential for catching silent EL failures.

Advanced Concepts

Q: What is a snapshot in dbt and when would you use one?

Snapshots capture historical changes (SCD Type 2). They add dbt_valid_from/dbt_valid_to columns. When a row changes, the old version gets an end date, and a new version is inserted. Use for tracking changes to dimensions over time (customer tier changes, product price history). Strategy: timestamp (needs an updated_at column) or check (compares column values).

Q: How do Jinja macros work in dbt? Give an example.

Macros are reusable Jinja functions. Defined in macros/*.sql, called in models.
-- macros/cents_to_dollars.sql
{% macro cents_to_dollars(col) %}
  ({{ col }} / 100.0)::numeric(10,2)
{% endmacro %}

-- In a model:
SELECT {{ cents_to_dollars('amount_cents') }} AS amount
Also useful for environment-specific logic: {% if target.name == 'prod' %}.

Q: How do you set up CI/CD for dbt?

dbt Cloud: Slim CI — on every PR, only run modified models + downstream against a temp schema. Auto-comment results on the PR. dbt Core: GitHub Actions / GitLab CI pipeline: (1) dbt deps, (2) dbt build --select state:modified+ with a manifest from prod, (3) fail the PR if tests fail. Both approaches ensure untested code never reaches production.

Q: What is the dbt semantic layer?

The semantic layer (dbt Cloud) defines metrics and dimensions in YAML using MetricFlow. Instead of every BI tool writing its own "total revenue" SQL, the metric is defined once in dbt and queried consistently via API. Prevents metric sprawl (5 different definitions of "revenue" across tools). Supports time grains, filters, and joins natively.

Scenario Questions

Q: Your incremental model is producing duplicate rows. How do you debug and fix it?

(1) Check if unique_key is set — without it, dbt appends instead of upserting. (2) Verify the incremental filter logic — is is_incremental() block correctly filtering to only new rows? (3) Check for late-arriving data that falls outside the filter window. (4) Run --full-refresh to rebuild cleanly. (5) Add a unique test on the primary key to catch duplicates early. (6) Consider using incremental_strategy='merge' with a proper unique_key.

Q: How would you orchestrate dbt in production with Airflow?

Common pattern: (1) Airflow DAG runs on schedule. (2) First tasks: wait for source data (sensors). (3) Run dbt build (or split into dbt run + dbt test). (4) Use cosmos (astronomer-cosmos) to parse the dbt DAG and create one Airflow task per dbt model for fine-grained monitoring. (5) On failure: alert, don't retry dbt models automatically (they're idempotent but may mask data issues).

Q: You inherit a dbt project with 200 models and no tests. How do you add data quality?

(1) Start with the highest-value models (marts used by executives). (2) Add unique + not_null tests on all primary keys — use codegen package to auto-generate YAML. (3) Add relationships tests between staging and marts. (4) Add source freshness checks. (5) Add accepted_values on status/type columns. (6) Incrementally add singular tests for known business rules. (7) Integrate dbt build into CI to prevent regressions.

Q: A daily dbt run takes 3 hours. How would you speed it up?

(1) Profile with dbt run --profile or warehouse query history to find slow models. (2) Convert large full-table rebuilds to incremental. (3) Materialize staging as views (they're instant). (4) Use dbt run -s tag:critical for priority models. (5) Parallelize — increase threads in profiles.yml. (6) Optimize slow SQL (check EXPLAIN, add clustering keys). (7) Consider splitting into multiple runs (staging hourly, marts daily).