Core Concepts of dbt

TL;DR

dbt has 6 building blocks: Models (SQL files → tables/views), ref() & source() (dependency graph), Tests (data quality), Materializations (how models are built), Macros (reusable Jinja SQL), and Documentation (auto-generated docs site).

The Big Picture

dbt core concepts: Models are SQL files, ref/source build the DAG, tests check quality, materializations control how models are built, macros reuse logic, docs describe everything

Explain Like I'm 12

Models = recipes (each one makes a dish). ref() = "use the output of that other recipe" (mashed potatoes need the boiled potatoes). Tests = taste-testing at each step. Materializations = do you serve it fresh every time (view) or make a big batch and store it (table)? Macros = that one trick your grandma taught you that works in every recipe. Docs = the cookbook that describes every recipe.

Cheat Sheet

Concept	What It Does	CLI Command
Models	SQL SELECT → table or view in warehouse	`dbt run`
ref() / source()	Reference models / raw tables, builds DAG	(used in SQL)
Tests	Assert data quality (unique, not_null, etc.)	`dbt test`
Materializations	table, view, incremental, ephemeral	(configured in YAML/SQL)
Macros	Reusable Jinja SQL functions	(called in models)
Docs	Auto-generated documentation site	`dbt docs generate && dbt docs serve`

The 6 Building Blocks

Models — SQL Files That Become Tables

A model is a .sql file containing a single SELECT statement. When you run dbt run, dbt wraps each model in DDL (CREATE TABLE/VIEW) and executes it in the warehouse.

-- models/staging/stg_orders.sql
SELECT
    id AS order_id,
    user_id AS customer_id,
    order_date,
    status,
    amount
FROM {{ source('raw', 'orders') }}
WHERE status != 'deleted'

You write the SELECT. dbt writes the CREATE TABLE AS, handles schema names, and runs models in dependency order.

Project structure: Models live in the models/ directory. Organize by layer: models/staging/, models/intermediate/, models/marts/. Each directory can have a _schema.yml file for tests and docs.

ref() & source() — The Dependency Graph

ref() references another dbt model. source() references raw tables loaded by your EL tool.

-- models/marts/fct_orders.sql
-- ref() builds the DAG: this model depends on stg_orders and stg_customers
SELECT
    o.order_id,
    o.customer_id,
    c.customer_name,
    o.amount,
    o.order_date
FROM {{ ref('stg_orders') }} o
LEFT JOIN {{ ref('stg_customers') }} c ON o.customer_id = c.customer_id

# models/staging/_sources.yml
sources:
  - name: raw
    database: analytics_raw
    schema: public
    tables:
      - name: orders
        loaded_at_field: _etl_loaded_at
        freshness:
          warn_after: {count: 12, period: hour}
          error_after: {count: 24, period: hour}

Never hardcode table names. Always use ref() and source(). They resolve the correct database/schema, enable lineage tracking, and ensure correct execution order.

Tests — Data Quality Guards

dbt has two types of tests:

Generic Tests (YAML-defined)

# models/staging/_schema.yml
models:
  - name: stg_orders
    columns:
      - name: order_id
        tests:
          - unique
          - not_null
      - name: status
        tests:
          - accepted_values:
              values: ['pending', 'shipped', 'delivered', 'returned']
      - name: customer_id
        tests:
          - relationships:
              to: ref('stg_customers')
              field: customer_id

Singular Tests (custom SQL)

-- tests/assert_positive_amounts.sql
-- Returns rows that FAIL the test (should return 0 rows)
SELECT order_id, amount
FROM {{ ref('fct_orders') }}
WHERE amount < 0

Run tests after every dbt run: dbt build runs models AND tests in order. A failing test on a staging model blocks downstream marts.

Materializations — How Models Are Built

Type	What It Creates	Rebuild Cost	Best For
view	SQL view (no stored data)	Instant	Simple transforms, staging models
table	Physical table (full rebuild)	Slow on large data	End-user models, complex transforms
incremental	Table + only process new rows	Fast (only delta)	Event tables, logs, large fact tables
ephemeral	CTE injected into parent model	None (not materialized)	Reusable logic, avoiding intermediate tables

-- models/marts/fct_events.sql
{{
  config(
    materialized='incremental',
    unique_key='event_id'
  )
}}

SELECT * FROM {{ ref('stg_events') }}

{% if is_incremental() %}
  WHERE event_timestamp > (SELECT MAX(event_timestamp) FROM {{ this }})
{% endif %}

Macros & Jinja — Reusable SQL

Jinja is a templating language mixed into dbt SQL. Macros are reusable Jinja functions.

-- macros/cents_to_dollars.sql
{% macro cents_to_dollars(column_name) %}
    ({{ column_name }} / 100.0)::numeric(10,2)
{% endmacro %}

-- Usage in a model:
SELECT
    order_id,
    {{ cents_to_dollars('amount_cents') }} AS amount_dollars
FROM {{ ref('stg_orders') }}

dbt packages: Install community macros via packages.yml. Popular: dbt-utils (surrogate keys, pivoting, date spine), dbt-expectations (Great Expectations-style tests), codegen (auto-generate model YAML).

Documentation — Auto-Generated Docs

dbt generates a searchable website from your YAML descriptions with a visual lineage graph.

# models/marts/_schema.yml
models:
  - name: fct_orders
    description: "One row per order with customer details. Grain: order_id."
    columns:
      - name: order_id
        description: "Primary key from source system"
      - name: amount_dollars
        description: "Order total in USD (converted from cents)"

dbt docs generate   # Creates the docs site
dbt docs serve       # Opens it in your browser

Test Yourself

Q: What is the difference between ref() and source()?

ref() references another dbt model (a .sql file you wrote). source() references a raw table loaded by an external tool (Fivetran, Airbyte). Both build the dependency graph, but source() also enables freshness checks.

Q: When would you use an incremental materialization vs a table?

Incremental when the table is large and you can identify new/changed rows (e.g., events with a timestamp). Only new rows are processed, making builds fast. Table when the data is small enough to fully rebuild each run, or when the logic is too complex for incremental.

Q: What does dbt build do?

dbt build runs models AND tests in DAG order. If a test fails on a model, downstream models are skipped. It combines dbt run + dbt test + dbt snapshot + dbt seed into one command.

Interview Questions

Q: Explain the dbt project structure. What goes where?

models/ — SQL models organized by layer (staging/, intermediate/, marts/). tests/ — singular test SQL files. macros/ — reusable Jinja functions. seeds/ — CSV files loaded as tables. snapshots/ — SCD Type 2 definitions. dbt_project.yml — project config (name, version, materializations). profiles.yml — warehouse connection details (local only, not committed).

Q: What is a snapshot in dbt?

Snapshots implement Slowly Changing Dimension Type 2 (SCD2). They track historical changes to rows by capturing dbt_valid_from and dbt_valid_to timestamps. When a row changes, the old version gets an end timestamp and a new version is inserted. Configured with a strategy (timestamp or check) and unique_key.

Q: How does dbt handle incremental models? What happens on the first run?

First run: full table build (like a table materialization). Subsequent runs: only process new/changed rows using the is_incremental() Jinja macro. If the filter logic is wrong, run dbt run --full-refresh to rebuild from scratch. The unique_key config handles upserts (merge/insert-overwrite).