Databricks Interview Questions

TL;DR

30+ Databricks interview questions organized by topic. Click Show Answer to reveal. Perfect for a 15-minute revision before an interview.

Short on time? Focus on Lakehouse Architecture, Delta Lake, and Spark SQL — these come up in almost every Databricks interview.

Lakehouse Architecture

Q: What is a lakehouse and how does it differ from a data lake or data warehouse?

A data warehouse stores structured data in a proprietary format and excels at BI queries but is expensive and rigid. A data lake stores raw data cheaply in open formats (Parquet, JSON) but lacks ACID transactions, schema enforcement, and governance — leading to the "data swamp" problem. A lakehouse combines both: it stores data in open formats on cheap object storage (like a lake) but adds a transactional metadata layer (Delta Lake) that gives you ACID transactions, schema enforcement, and BI-grade performance (like a warehouse). Databricks coined the term and implements it via Delta Lake on top of cloud storage.

Q: Explain the Medallion architecture (Bronze / Silver / Gold).

The Medallion architecture organizes data into three progressive quality layers. Bronze is the raw ingestion layer — data lands here as-is from sources (CDC feeds, API responses, CSVs) with minimal transformation, preserving the full history. Silver is the cleaned and conformed layer — data is deduplicated, schema-validated, joined, and enriched. Gold is the business-level aggregation layer — purpose-built tables optimized for BI dashboards and reporting (e.g., daily_revenue_by_region). Each layer is a Delta table, so you get ACID guarantees and time travel at every stage.

Q: What are the advantages of open file formats like Delta and Parquet?

No vendor lock-in — any engine (Spark, Trino, DuckDB, Polars) can read Parquet and Delta. Columnar storage — queries that read a subset of columns skip irrelevant data, massively reducing I/O. Efficient compression — columnar layout enables dictionary, run-length, and Snappy/Zstd encoding, often achieving 5-10x compression vs. CSV. Schema metadata is embedded in the file footer, so readers know column types without an external catalog. Delta adds a transaction log on top of Parquet, giving you ACID transactions and time travel that plain Parquet lacks.

Q: How does Databricks handle schema enforcement vs. schema evolution?

Schema enforcement (schema on write) rejects any write that doesn't match the table's existing schema — for example, writing a string into an integer column or adding an unexpected column. This is the default behavior for Delta tables and prevents silent data corruption. Schema evolution allows the schema to change over time. You enable it with .option("mergeSchema", "true") on a write or ALTER TABLE ... ADD COLUMNS. New columns are added automatically and backfilled with nulls for existing rows. You can also use .option("overwriteSchema", "true") for a complete schema replacement, but this is destructive.

Q: What is the difference between managed and external tables?

A managed table has both its metadata and data files controlled by Databricks (or Unity Catalog). When you DROP a managed table, the underlying data files are deleted too. An external table points to data at a location you specify (e.g., s3://my-bucket/sales/). Dropping an external table removes the metadata from the catalog, but the data files remain untouched. Use managed tables when Databricks is the single owner of the data; use external tables when data is shared across multiple engines or you need to control the storage location.

Q: How does Databricks SQL differ from a Spark cluster for running queries?

Databricks SQL (formerly SQL Analytics) provides serverless SQL warehouses optimized purely for BI and ad-hoc queries. They start faster, auto-scale horizontally, support caching, and use the Photon engine by default for vectorized execution. All-Purpose clusters run general Spark workloads (Python, Scala, R, SQL) and are better suited for ETL notebooks, ML training, and interactive development. SQL warehouses are typically cheaper for pure SQL workloads because they can suspend to zero and avoid over-provisioning.

Go deeper: See Databricks Overview for a full picture of the lakehouse platform and its components.

Delta Lake

Q: What is Delta Lake and how does it differ from plain Parquet?

Delta Lake is an open-source storage layer that brings ACID transactions to data lakes. Underneath, data is still stored as Parquet files, but Delta adds a transaction log (_delta_log/) that tracks every change. This gives you features plain Parquet can't provide: atomic writes, consistent reads (snapshot isolation), time travel, schema enforcement, and the ability to do UPDATE, DELETE, and MERGE operations. Plain Parquet is just a file format — Delta Lake is a file format plus a protocol.

Q: Explain ACID transactions in the context of Delta Lake.

Atomicity — a multi-file write either fully succeeds or fully rolls back; no partial writes. Consistency — schema enforcement ensures data matches the expected types and columns. Isolation — readers see a consistent snapshot via optimistic concurrency control; concurrent writes use the transaction log to detect conflicts. Durability — once a commit is recorded in the _delta_log, the data is persisted to cloud object storage and survives failures. This is a huge improvement over plain Parquet, where a failed Spark job could leave orphan files and corrupt readers.

Q: What is the transaction log (_delta_log) and how does it work?

The _delta_log/ directory sits alongside the Parquet data files. It contains ordered JSON files (000000.json, 000001.json, ...) — each recording a commit: which files were added, removed, or modified, plus metadata changes. Every 10 commits, a checkpoint Parquet file is created that summarizes the current state, so readers don't need to replay all JSON files from the beginning. When a reader opens a Delta table, it reads the latest checkpoint + any subsequent JSON commits to reconstruct the current list of valid data files. This log is the backbone of time travel, schema enforcement, and ACID semantics.

Q: Explain time travel in Delta Lake. How would you query a previous version?

Because the transaction log records every commit, you can query any previous version of a table. Two ways:

-- By version number
SELECT * FROM my_table VERSION AS OF 5;

-- By timestamp
SELECT * FROM my_table TIMESTAMP AS OF '2026-03-01';

Use cases include auditing (what did this table look like last Tuesday?), debugging (compare current vs. previous version to find bad data), and disaster recovery (restore a table after an accidental DELETE with RESTORE TABLE my_table TO VERSION AS OF 5). Time travel is limited by the retention period, which defaults to 30 days.

Q: What does MERGE INTO do? Write a basic MERGE statement.

MERGE INTO performs upserts — insert new rows and update existing ones in a single atomic operation. It matches rows from a source against a target using a join condition, then defines what to do when rows match vs. don't match.

MERGE INTO target_table AS t
USING source_table AS s
ON t.id = s.id
WHEN MATCHED THEN
  UPDATE SET t.name = s.name, t.updated_at = s.updated_at
WHEN NOT MATCHED THEN
  INSERT (id, name, updated_at)
  VALUES (s.id, s.name, s.updated_at);

This is the workhorse of CDC (Change Data Capture) pipelines. Without Delta Lake, you'd need to rewrite entire partitions or use complex deduplication logic.

Q: What is VACUUM and what happens if you set retention too low?

VACUUM deletes data files that are no longer referenced by the transaction log (i.e., files from old versions). By default it retains files for 7 days.

-- Remove files older than 7 days (default)
VACUUM my_table;

-- Remove files older than 24 hours (risky)
VACUUM my_table RETAIN 24 HOURS;

If you set retention too low, you break time travel — queries referencing old versions will fail with "file not found" errors. You also risk breaking long-running queries that started before the vacuum ran. Databricks prevents retention below 7 days unless you explicitly set spark.databricks.delta.retentionDurationCheck.enabled = false. Best practice: keep the default 7-day retention unless storage costs are a major concern.

Q: What is Z-ORDER and when would you use it?

Z-ORDER is a multi-dimensional clustering technique that co-locates related data within the same set of files. When you run:

OPTIMIZE my_table ZORDER BY (country, date);

Delta rearranges data so that rows with the same country and date values are stored close together. Queries that filter on these columns can skip entire files via data skipping (min/max statistics in the transaction log). Use Z-ORDER on columns you frequently filter by — typically high-cardinality columns used in WHERE clauses. Don't Z-ORDER on too many columns (2-4 max) as effectiveness decreases. Z-ORDER is being superseded by Liquid Clustering in newer Databricks runtimes, which handles clustering automatically.

Go deeper: See Delta Lake Deep Dive for transaction log internals, OPTIMIZE strategies, and production patterns.

Spark SQL

Q: What's the difference between a temporary view and a global temporary view?

A temporary view (CREATE TEMP VIEW) is scoped to the current SparkSession (i.e., the current notebook or job). Other notebooks can't see it. A global temporary view (CREATE GLOBAL TEMP VIEW) is scoped to the Spark application and visible across all SparkSessions on the same cluster — accessed via the global_temp database (e.g., SELECT * FROM global_temp.my_view). Both disappear when the cluster restarts. In practice, most people use regular temp views; global temps are rare and mostly useful for sharing data between notebooks in the same cluster session.

Q: Explain window functions. Write a query using ROW_NUMBER().

Window functions perform calculations across a set of rows related to the current row — without collapsing rows like GROUP BY does. You define a window with OVER(PARTITION BY ... ORDER BY ...).

SELECT
  customer_id,
  order_date,
  amount,
  ROW_NUMBER() OVER (
    PARTITION BY customer_id
    ORDER BY order_date DESC
  ) AS rn
FROM orders;

This assigns rn = 1 to the most recent order per customer. Common pattern: wrap it in a CTE and filter WHERE rn = 1 to get the latest row per group. Other useful window functions: RANK(), DENSE_RANK(), LAG(), LEAD(), and SUM() OVER (...) for running totals.

Q: What is the Catalyst optimizer?

Catalyst is Spark SQL's query optimizer. It takes your logical plan (what you asked for) and produces an optimized physical plan (how to execute it) through four phases: (1) Analysis — resolves column names and types using the catalog. (2) Logical optimization — applies rule-based rewrites like predicate pushdown, constant folding, and projection pruning. (3) Physical planning — selects join strategies (broadcast hash join vs. sort-merge join) based on table sizes. (4) Code generation — compiles the physical plan into optimized Java bytecode (whole-stage codegen). Catalyst is why writing df.filter().select() produces the same plan as writing the equivalent SQL — both go through the same optimizer.

Q: What's the difference between Spark SQL and traditional SQL?

Spark SQL looks like traditional SQL but runs on a distributed engine. Key differences: (1) Data is partitioned across a cluster — a JOIN may trigger a shuffle (data movement across nodes). (2) There's no traditional index — performance comes from partitioning, Z-ORDER, and data skipping. (3) Spark SQL supports semi-structured data natively (structs, arrays, maps) — you can do col.field or explode(array_col). (4) It processes data in batches, not row-by-row. (5) It integrates with Python/Scala DataFrames — you can mix SQL and code freely. The syntax is ANSI SQL-compatible, so most queries transfer directly, but tuning is fundamentally different.

Q: How do you handle semi-structured (JSON) data in Spark SQL?

Spark SQL handles JSON natively. You can read JSON files with schema inference, then query nested fields with dot notation:

-- Read JSON into a table
CREATE TABLE events USING json
OPTIONS (path 's3://bucket/events/');

-- Query nested fields
SELECT
  payload.user.name,
  payload.event_type,
  explode(payload.items) AS item
FROM events
WHERE payload.event_type = 'purchase';

Key functions: from_json() to parse a JSON string column into a struct, to_json() for the reverse, explode() to flatten arrays, get_json_object() for quick extraction via JSONPath, and the : operator for variant types in Databricks (e.g., raw_col:key::string).

Q: What is the Photon engine?

Photon is Databricks' proprietary native C++ vectorized execution engine that replaces Spark's JVM-based execution for supported operations. It processes data in columnar batches using SIMD instructions, which is dramatically faster for scan-heavy, filter-heavy, and aggregation queries. Photon is enabled by default on Databricks SQL warehouses and can be toggled on for all-purpose clusters. It's particularly effective for Delta Lake workloads because it understands the Delta format natively. Not all operations are Photon-accelerated — unsupported operations fall back to standard Spark execution transparently. Typical speedups range from 2x to 8x for analytical queries.

Go deeper: See Spark SQL on Databricks for window functions, CTEs, Photon tuning, and real-world query patterns.

Unity Catalog & Governance

Q: What is Unity Catalog and what problems does it solve?

Unity Catalog is Databricks' unified governance layer for all data and AI assets. Before Unity Catalog, each workspace had its own Hive metastore, so sharing data between workspaces required manual grants and duplicate metadata. Unity Catalog solves this by providing: (1) Centralized access control — one place to manage permissions across all workspaces. (2) Three-level namespace (catalog.schema.table) for organizing data. (3) Data lineage — automatic tracking of column-level lineage. (4) Audit logging — who accessed what, when. (5) Data sharing via Delta Sharing (open protocol, works across platforms). It governs tables, views, volumes (files), models, and functions.

Q: Explain the 3-level namespace in Unity Catalog.

Unity Catalog uses a three-level namespace: catalog.schema.object. A catalog is the top-level container (think of it as a database in traditional RDBMS terms) — often one per environment or business domain (e.g., production, development, finance). A schema (database) is a grouping within a catalog (e.g., production.sales). Objects are tables, views, volumes, functions, and models within a schema. Example: production.sales.orders. This replaces the old two-level database.table pattern and lets you organize and govern data hierarchically. Permissions cascade: granting USE CATALOG on production gives access to browse its schemas.

Q: How does data lineage work in Unity Catalog?

Unity Catalog automatically captures column-level lineage whenever a query reads from or writes to a table. It tracks: which tables and columns were used as inputs, which outputs were produced, and which notebook/job/pipeline performed the transformation. You can view lineage in the Databricks UI as an interactive graph. This is captured without any additional code — the query engine logs lineage metadata as part of execution. Use cases: impact analysis (if I change this column, what downstream tables are affected?), debugging (where did this bad data originate?), and compliance (proving data provenance for auditors). Lineage covers Spark SQL, PySpark DataFrames, and Delta Live Tables.

Q: What is row-level security and column masking in Databricks?

Row-level security (row filters) restricts which rows a user can see based on their identity. You create a SQL function that returns a boolean, then apply it:

CREATE FUNCTION region_filter(region STRING)
RETURN IF(IS_ACCOUNT_GROUP_MEMBER('us_team'), region = 'US', true);

ALTER TABLE sales SET ROW FILTER region_filter ON (region);

Column masking hides or redacts sensitive column values. You define a masking function and apply it to a column:

CREATE FUNCTION mask_email(email STRING)
RETURN IF(IS_ACCOUNT_GROUP_MEMBER('pii_users'), email, '***@***.com');

ALTER TABLE customers ALTER COLUMN email SET MASK mask_email;

Both are enforced at the query engine level — they apply regardless of whether the user accesses data via SQL, PySpark, or BI tools. This is a Unity Catalog feature.

Q: What is Delta Sharing and how does it work?

Delta Sharing is an open protocol for secure, real-time data sharing across organizations and platforms. The data provider creates a share (a named collection of tables) and grants access to a recipient. The recipient gets a credentials file and can read shared data using any compatible client — Databricks, Apache Spark, pandas, Power BI, or any REST client. Key benefits: (1) No data copying — the recipient reads directly from the provider's storage. (2) Open protocol — not locked to Databricks. (3) Fine-grained access — share specific tables, partitions, or filtered views. (4) Audit trail — the provider can track who accessed what. It works across cloud providers (AWS, Azure, GCP).

Clusters & Performance

Q: What types of clusters does Databricks offer?

Databricks offers three main compute types: (1) All-Purpose clusters — interactive, multi-user clusters for notebooks and development. Support Python, SQL, Scala, R. Can be shared or single-user. (2) Job clusters — ephemeral clusters that spin up for a specific job and terminate when it finishes. Cheaper because they don't sit idle. Always use these for production pipelines. (3) SQL Warehouses — optimized for SQL workloads, BI tool connections, and ad-hoc queries. Available in Classic, Pro, and Serverless tiers. Serverless SQL warehouses start in seconds and auto-scale without infrastructure management. Additionally, Instance Pools pre-allocate VMs to reduce cluster startup time.

Q: What is autoscaling and when should you use it?

Autoscaling automatically adjusts the number of worker nodes based on workload. You configure a min and max worker count. When Spark detects pending tasks, it adds workers; when idle, it removes them. Use it for: interactive/ad-hoc workloads with unpredictable volume, SQL warehouses serving BI tools, and pipelines with varying stage sizes. Avoid it for: latency-sensitive streaming jobs (scaling takes minutes), short batch jobs (overhead of scaling exceeds the runtime), and workloads with consistent resource needs. For production batch jobs, a fixed-size cluster is often more cost-efficient and predictable. Also be aware of optimized autoscaling on Databricks, which scales down more aggressively than standard Spark autoscaling.

Q: How do you troubleshoot a slow Spark job?

Systematic approach: (1) Check the Spark UI — look at the Stages tab for stages with high task counts or long durations. (2) Identify shuffles — look for large shuffle read/write in the stage details. Shuffles are the #1 performance killer. (3) Check for data skew — in the stage summary, compare the median task duration to the max. If max is 10x+ the median, you have skew. (4) Review the physical plan — run EXPLAIN EXTENDED and look for SortMergeJoin (expensive) vs. BroadcastHashJoin (fast). (5) Check for spill — "Spill to disk" in the Spark UI means executors ran out of memory. (6) Examine the data — small files, too many partitions, or uncompacted Delta tables all hurt performance. (7) Run OPTIMIZE and ZORDER on frequently queried Delta tables.

Q: What is data skew and how do you fix it?

Data skew happens when data is unevenly distributed across partitions, causing a few tasks to process far more data than others. The job is only as fast as the slowest task, so one skewed partition can bottleneck everything. Fixes: (1) Salting — add a random prefix to the skewed key, perform the join/aggregation, then remove it. (2) Broadcast join — if one side of the join is small (<10MB default), broadcast it to avoid shuffling. (3) Adaptive Query Execution (AQE) — enable spark.sql.adaptive.enabled = true (default in Databricks). AQE can detect skew at runtime and split skewed partitions automatically. (4) Repartition — explicitly repartition by a more evenly distributed column before the join. (5) Filter out the skewed value (often NULL) and process it separately.

Q: Explain the difference between narrow and wide transformations.

Narrow transformations (e.g., filter, select, map, WHERE) process data within a single partition — no data movement between nodes. They're cheap and pipeline together. Wide transformations (e.g., groupBy, JOIN, DISTINCT, ORDER BY, repartition) require data to be shuffled across the network to different partitions. Each wide transformation creates a stage boundary in the Spark execution plan. Shuffles write intermediate data to disk and transfer it over the network, making them the most expensive operation in Spark. Optimizing Spark jobs is largely about minimizing unnecessary shuffles.

Q: What is Adaptive Query Execution (AQE)?

AQE optimizes query plans at runtime, not just at compile time. It re-optimizes the remaining plan after each stage finishes, using actual data statistics instead of estimates. Three key optimizations: (1) Coalescing post-shuffle partitions — merges small partitions after a shuffle to reduce task overhead. (2) Switching join strategies — converts a SortMergeJoin to a BroadcastHashJoin if the actual data size is small enough (even if the estimate was wrong). (3) Handling skewed joins — detects skewed partitions and splits them automatically. AQE is enabled by default in Databricks Runtime. It's one of the most impactful "free" performance improvements — many slow queries speed up just by having AQE on.

Scenario Questions

Q: Design a data pipeline that ingests CSV files daily into a Delta Lake table.

Step-by-step: (1) Landing zone — CSVs are dropped into a cloud storage location (e.g., s3://raw/csv-uploads/). (2) Bronze ingestion — use Auto Loader (cloudFiles) to incrementally read new files. Auto Loader tracks which files have been processed so you never double-read:

df = (spark.readStream
    .format("cloudFiles")
    .option("cloudFiles.format", "csv")
    .option("cloudFiles.schemaLocation", "/checkpoints/schema/")
    .option("header", "true")
    .load("s3://raw/csv-uploads/"))

(df.writeStream
    .format("delta")
    .option("checkpointLocation", "/checkpoints/bronze/")
    .trigger(availableNow=True)
    .toTable("bronze.raw_data"))

(3) Silver transformation — a scheduled Databricks job reads from bronze, cleans/validates, and writes to a silver Delta table using MERGE INTO for idempotency. (4) Gold aggregation — another job builds business-level aggregations. (5) Schedule the pipeline with Databricks Workflows, which orchestrates tasks with dependencies and retries. Add data quality checks using Delta Live Tables expectations or custom assertions.

Q: Your MERGE query is running slowly on a 1TB table. How do you optimize it?

Systematic optimization: (1) Partition the target table on a column used in the merge condition (e.g., date). Add a filter in the merge source so Spark only scans relevant partitions:

MERGE INTO target AS t
USING source AS s
ON t.id = s.id AND t.date = s.date  -- partition filter
...

(2) Z-ORDER the target table on the merge key (OPTIMIZE target ZORDER BY (id)) so data skipping is effective. (3) Reduce source size — only select the columns you actually need from the source. (4) Use broadcast hint if the source is small: USING /*+ BROADCAST(s) */ source AS s. (5) Run OPTIMIZE on the target table before the merge to compact small files. (6) Check for skew on the merge key in the Spark UI. (7) Enable low-shuffle merge with spark.databricks.delta.merge.lowShuffle.enabled = true. (8) Consider switching to Liquid Clustering instead of partitioning + Z-ORDER for easier maintenance.

Q: A team reports stale data in their Power BI dashboard connected to Databricks SQL. How do you debug?

Debug layer by layer: (1) Check the pipeline — did the upstream ETL job actually run? Check the Databricks Workflows run history for failures or delays. (2) Query the table directly in Databricks SQL — run SELECT MAX(updated_at) FROM the_table. If data is fresh here, the problem is downstream. (3) Check the SQL warehouse — Databricks SQL can cache query results. Run ALTER TABLE ... REFRESH or restart the warehouse to clear caches. (4) Check Power BI — Power BI has its own caching layer. Check if the dataset refresh schedule is running, or if users are hitting an Import mode cache rather than DirectQuery. (5) Check table history: DESCRIBE HISTORY the_table shows when the last write occurred and what operation it was. (6) If using views, verify the view definition hasn't changed and intermediate tables are up to date.

Q: You need to implement SCD Type 2 in Databricks. Describe your approach.

SCD Type 2 preserves the full history of changes with effective_from, effective_to, and is_current columns. Two approaches in Databricks: Approach 1: MERGE with Delta Lake — Use a two-step merge. First, identify changed rows and expire them (set effective_to = current_date, is_current = false). Then insert new versions:

-- Step 1: Expire changed rows
MERGE INTO dim_customer AS t
USING staged_changes AS s
ON t.customer_id = s.customer_id AND t.is_current = true
WHEN MATCHED AND t.name != s.name THEN
  UPDATE SET t.effective_to = current_date(),
             t.is_current = false;

-- Step 2: Insert new versions
INSERT INTO dim_customer
SELECT customer_id, name, current_date() AS effective_from,
       NULL AS effective_to, true AS is_current
FROM staged_changes s
WHERE EXISTS (SELECT 1 FROM dim_customer t
  WHERE t.customer_id = s.customer_id
  AND t.is_current = false
  AND t.effective_to = current_date());

Approach 2: Delta Live Tables (DLT) — Use APPLY CHANGES INTO with the SCD TYPE 2 keyword, which handles the logic automatically. This is the recommended approach for production because it handles edge cases (out-of-order events, duplicates) for you.

Q: A team member accidentally deleted rows from a production table. How do you recover?

Delta Lake's time travel makes this straightforward. First, check the table history to find the version before the delete:

-- See recent operations
DESCRIBE HISTORY production.sales.orders;

-- Verify the data in the previous version
SELECT COUNT(*) FROM production.sales.orders
VERSION AS OF 42;

-- Option 1: Restore the entire table to that version
RESTORE TABLE production.sales.orders TO VERSION AS OF 42;

-- Option 2: Re-insert just the deleted rows
INSERT INTO production.sales.orders
SELECT * FROM production.sales.orders VERSION AS OF 42
EXCEPT
SELECT * FROM production.sales.orders;

RESTORE is usually the simplest option. It's a metadata-only operation — it doesn't copy data files, just updates the transaction log to point to the old files. This works as long as you haven't run VACUUM since the delete. After recovery, audit the situation: consider adding row filters or restricting DELETE permissions via Unity Catalog to prevent recurrence.

Databricks Interview Questions

Lakehouse Architecture

Delta Lake

Spark SQL

Unity Catalog & Governance

Clusters & Performance

Scenario Questions

Related Topics