Databricks Interview Questions
30+ Databricks interview questions organized by topic. Click Show Answer to reveal. Perfect for a 15-minute revision before an interview.
Lakehouse Architecture
Q: What is a lakehouse and how does it differ from a data lake or data warehouse?
Q: Explain the Medallion architecture (Bronze / Silver / Gold).
daily_revenue_by_region). Each layer is a Delta table, so you get ACID guarantees and time travel at every stage.Q: What are the advantages of open file formats like Delta and Parquet?
Q: How does Databricks handle schema enforcement vs. schema evolution?
.option("mergeSchema", "true") on a write or ALTER TABLE ... ADD COLUMNS. New columns are added automatically and backfilled with nulls for existing rows. You can also use .option("overwriteSchema", "true") for a complete schema replacement, but this is destructive.Q: What is the difference between managed and external tables?
DROP a managed table, the underlying data files are deleted too. An external table points to data at a location you specify (e.g., s3://my-bucket/sales/). Dropping an external table removes the metadata from the catalog, but the data files remain untouched. Use managed tables when Databricks is the single owner of the data; use external tables when data is shared across multiple engines or you need to control the storage location.Q: How does Databricks SQL differ from a Spark cluster for running queries?
Delta Lake
Q: What is Delta Lake and how does it differ from plain Parquet?
_delta_log/) that tracks every change. This gives you features plain Parquet can't provide: atomic writes, consistent reads (snapshot isolation), time travel, schema enforcement, and the ability to do UPDATE, DELETE, and MERGE operations. Plain Parquet is just a file format — Delta Lake is a file format plus a protocol.Q: Explain ACID transactions in the context of Delta Lake.
_delta_log, the data is persisted to cloud object storage and survives failures. This is a huge improvement over plain Parquet, where a failed Spark job could leave orphan files and corrupt readers.Q: What is the transaction log (_delta_log) and how does it work?
_delta_log/ directory sits alongside the Parquet data files. It contains ordered JSON files (000000.json, 000001.json, ...) — each recording a commit: which files were added, removed, or modified, plus metadata changes. Every 10 commits, a checkpoint Parquet file is created that summarizes the current state, so readers don't need to replay all JSON files from the beginning. When a reader opens a Delta table, it reads the latest checkpoint + any subsequent JSON commits to reconstruct the current list of valid data files. This log is the backbone of time travel, schema enforcement, and ACID semantics.Q: Explain time travel in Delta Lake. How would you query a previous version?
-- By version number
SELECT * FROM my_table VERSION AS OF 5;
-- By timestamp
SELECT * FROM my_table TIMESTAMP AS OF '2026-03-01';
Use cases include auditing (what did this table look like last Tuesday?), debugging (compare current vs. previous version to find bad data), and disaster recovery (restore a table after an accidental DELETE with RESTORE TABLE my_table TO VERSION AS OF 5). Time travel is limited by the retention period, which defaults to 30 days.Q: What does MERGE INTO do? Write a basic MERGE statement.
MERGE INTO performs upserts — insert new rows and update existing ones in a single atomic operation. It matches rows from a source against a target using a join condition, then defines what to do when rows match vs. don't match.
MERGE INTO target_table AS t
USING source_table AS s
ON t.id = s.id
WHEN MATCHED THEN
UPDATE SET t.name = s.name, t.updated_at = s.updated_at
WHEN NOT MATCHED THEN
INSERT (id, name, updated_at)
VALUES (s.id, s.name, s.updated_at);
This is the workhorse of CDC (Change Data Capture) pipelines. Without Delta Lake, you'd need to rewrite entire partitions or use complex deduplication logic.Q: What is VACUUM and what happens if you set retention too low?
VACUUM deletes data files that are no longer referenced by the transaction log (i.e., files from old versions). By default it retains files for 7 days.
-- Remove files older than 7 days (default)
VACUUM my_table;
-- Remove files older than 24 hours (risky)
VACUUM my_table RETAIN 24 HOURS;
If you set retention too low, you break time travel — queries referencing old versions will fail with "file not found" errors. You also risk breaking long-running queries that started before the vacuum ran. Databricks prevents retention below 7 days unless you explicitly set spark.databricks.delta.retentionDurationCheck.enabled = false. Best practice: keep the default 7-day retention unless storage costs are a major concern.Q: What is Z-ORDER and when would you use it?
OPTIMIZE my_table ZORDER BY (country, date);
Delta rearranges data so that rows with the same country and date values are stored close together. Queries that filter on these columns can skip entire files via data skipping (min/max statistics in the transaction log). Use Z-ORDER on columns you frequently filter by — typically high-cardinality columns used in WHERE clauses. Don't Z-ORDER on too many columns (2-4 max) as effectiveness decreases. Z-ORDER is being superseded by Liquid Clustering in newer Databricks runtimes, which handles clustering automatically.Spark SQL
Q: What's the difference between a temporary view and a global temporary view?
CREATE TEMP VIEW) is scoped to the current SparkSession (i.e., the current notebook or job). Other notebooks can't see it. A global temporary view (CREATE GLOBAL TEMP VIEW) is scoped to the Spark application and visible across all SparkSessions on the same cluster — accessed via the global_temp database (e.g., SELECT * FROM global_temp.my_view). Both disappear when the cluster restarts. In practice, most people use regular temp views; global temps are rare and mostly useful for sharing data between notebooks in the same cluster session.Q: Explain window functions. Write a query using ROW_NUMBER().
GROUP BY does. You define a window with OVER(PARTITION BY ... ORDER BY ...).
SELECT
customer_id,
order_date,
amount,
ROW_NUMBER() OVER (
PARTITION BY customer_id
ORDER BY order_date DESC
) AS rn
FROM orders;
This assigns rn = 1 to the most recent order per customer. Common pattern: wrap it in a CTE and filter WHERE rn = 1 to get the latest row per group. Other useful window functions: RANK(), DENSE_RANK(), LAG(), LEAD(), and SUM() OVER (...) for running totals.Q: What is the Catalyst optimizer?
df.filter().select() produces the same plan as writing the equivalent SQL — both go through the same optimizer.Q: What's the difference between Spark SQL and traditional SQL?
JOIN may trigger a shuffle (data movement across nodes). (2) There's no traditional index — performance comes from partitioning, Z-ORDER, and data skipping. (3) Spark SQL supports semi-structured data natively (structs, arrays, maps) — you can do col.field or explode(array_col). (4) It processes data in batches, not row-by-row. (5) It integrates with Python/Scala DataFrames — you can mix SQL and code freely. The syntax is ANSI SQL-compatible, so most queries transfer directly, but tuning is fundamentally different.Q: How do you handle semi-structured (JSON) data in Spark SQL?
-- Read JSON into a table
CREATE TABLE events USING json
OPTIONS (path 's3://bucket/events/');
-- Query nested fields
SELECT
payload.user.name,
payload.event_type,
explode(payload.items) AS item
FROM events
WHERE payload.event_type = 'purchase';
Key functions: from_json() to parse a JSON string column into a struct, to_json() for the reverse, explode() to flatten arrays, get_json_object() for quick extraction via JSONPath, and the : operator for variant types in Databricks (e.g., raw_col:key::string).Q: What is the Photon engine?
Unity Catalog & Governance
Q: What is Unity Catalog and what problems does it solve?
catalog.schema.table) for organizing data. (3) Data lineage — automatic tracking of column-level lineage. (4) Audit logging — who accessed what, when. (5) Data sharing via Delta Sharing (open protocol, works across platforms). It governs tables, views, volumes (files), models, and functions.Q: Explain the 3-level namespace in Unity Catalog.
catalog.schema.object. A catalog is the top-level container (think of it as a database in traditional RDBMS terms) — often one per environment or business domain (e.g., production, development, finance). A schema (database) is a grouping within a catalog (e.g., production.sales). Objects are tables, views, volumes, functions, and models within a schema. Example: production.sales.orders. This replaces the old two-level database.table pattern and lets you organize and govern data hierarchically. Permissions cascade: granting USE CATALOG on production gives access to browse its schemas.Q: How does data lineage work in Unity Catalog?
Q: What is row-level security and column masking in Databricks?
CREATE FUNCTION region_filter(region STRING)
RETURN IF(IS_ACCOUNT_GROUP_MEMBER('us_team'), region = 'US', true);
ALTER TABLE sales SET ROW FILTER region_filter ON (region);
Column masking hides or redacts sensitive column values. You define a masking function and apply it to a column:
CREATE FUNCTION mask_email(email STRING)
RETURN IF(IS_ACCOUNT_GROUP_MEMBER('pii_users'), email, '***@***.com');
ALTER TABLE customers ALTER COLUMN email SET MASK mask_email;
Both are enforced at the query engine level — they apply regardless of whether the user accesses data via SQL, PySpark, or BI tools. This is a Unity Catalog feature.Q: What is Delta Sharing and how does it work?
Clusters & Performance
Q: What types of clusters does Databricks offer?
Q: What is autoscaling and when should you use it?
Q: How do you troubleshoot a slow Spark job?
EXPLAIN EXTENDED and look for SortMergeJoin (expensive) vs. BroadcastHashJoin (fast). (5) Check for spill — "Spill to disk" in the Spark UI means executors ran out of memory. (6) Examine the data — small files, too many partitions, or uncompacted Delta tables all hurt performance. (7) Run OPTIMIZE and ZORDER on frequently queried Delta tables.Q: What is data skew and how do you fix it?
spark.sql.adaptive.enabled = true (default in Databricks). AQE can detect skew at runtime and split skewed partitions automatically. (4) Repartition — explicitly repartition by a more evenly distributed column before the join. (5) Filter out the skewed value (often NULL) and process it separately.Q: Explain the difference between narrow and wide transformations.
filter, select, map, WHERE) process data within a single partition — no data movement between nodes. They're cheap and pipeline together. Wide transformations (e.g., groupBy, JOIN, DISTINCT, ORDER BY, repartition) require data to be shuffled across the network to different partitions. Each wide transformation creates a stage boundary in the Spark execution plan. Shuffles write intermediate data to disk and transfer it over the network, making them the most expensive operation in Spark. Optimizing Spark jobs is largely about minimizing unnecessary shuffles.Q: What is Adaptive Query Execution (AQE)?
Scenario Questions
Q: Design a data pipeline that ingests CSV files daily into a Delta Lake table.
s3://raw/csv-uploads/). (2) Bronze ingestion — use Auto Loader (cloudFiles) to incrementally read new files. Auto Loader tracks which files have been processed so you never double-read:
df = (spark.readStream
.format("cloudFiles")
.option("cloudFiles.format", "csv")
.option("cloudFiles.schemaLocation", "/checkpoints/schema/")
.option("header", "true")
.load("s3://raw/csv-uploads/"))
(df.writeStream
.format("delta")
.option("checkpointLocation", "/checkpoints/bronze/")
.trigger(availableNow=True)
.toTable("bronze.raw_data"))
(3) Silver transformation — a scheduled Databricks job reads from bronze, cleans/validates, and writes to a silver Delta table using MERGE INTO for idempotency. (4) Gold aggregation — another job builds business-level aggregations. (5) Schedule the pipeline with Databricks Workflows, which orchestrates tasks with dependencies and retries. Add data quality checks using Delta Live Tables expectations or custom assertions.Q: Your MERGE query is running slowly on a 1TB table. How do you optimize it?
date). Add a filter in the merge source so Spark only scans relevant partitions:
MERGE INTO target AS t
USING source AS s
ON t.id = s.id AND t.date = s.date -- partition filter
...
(2) Z-ORDER the target table on the merge key (OPTIMIZE target ZORDER BY (id)) so data skipping is effective. (3) Reduce source size — only select the columns you actually need from the source. (4) Use broadcast hint if the source is small: USING /*+ BROADCAST(s) */ source AS s. (5) Run OPTIMIZE on the target table before the merge to compact small files. (6) Check for skew on the merge key in the Spark UI. (7) Enable low-shuffle merge with spark.databricks.delta.merge.lowShuffle.enabled = true. (8) Consider switching to Liquid Clustering instead of partitioning + Z-ORDER for easier maintenance.Q: A team reports stale data in their Power BI dashboard connected to Databricks SQL. How do you debug?
SELECT MAX(updated_at) FROM the_table. If data is fresh here, the problem is downstream. (3) Check the SQL warehouse — Databricks SQL can cache query results. Run ALTER TABLE ... REFRESH or restart the warehouse to clear caches. (4) Check Power BI — Power BI has its own caching layer. Check if the dataset refresh schedule is running, or if users are hitting an Import mode cache rather than DirectQuery. (5) Check table history: DESCRIBE HISTORY the_table shows when the last write occurred and what operation it was. (6) If using views, verify the view definition hasn't changed and intermediate tables are up to date.Q: You need to implement SCD Type 2 in Databricks. Describe your approach.
effective_from, effective_to, and is_current columns. Two approaches in Databricks:
Approach 1: MERGE with Delta Lake — Use a two-step merge. First, identify changed rows and expire them (set effective_to = current_date, is_current = false). Then insert new versions:
-- Step 1: Expire changed rows
MERGE INTO dim_customer AS t
USING staged_changes AS s
ON t.customer_id = s.customer_id AND t.is_current = true
WHEN MATCHED AND t.name != s.name THEN
UPDATE SET t.effective_to = current_date(),
t.is_current = false;
-- Step 2: Insert new versions
INSERT INTO dim_customer
SELECT customer_id, name, current_date() AS effective_from,
NULL AS effective_to, true AS is_current
FROM staged_changes s
WHERE EXISTS (SELECT 1 FROM dim_customer t
WHERE t.customer_id = s.customer_id
AND t.is_current = false
AND t.effective_to = current_date());
Approach 2: Delta Live Tables (DLT) — Use APPLY CHANGES INTO with the SCD TYPE 2 keyword, which handles the logic automatically. This is the recommended approach for production because it handles edge cases (out-of-order events, duplicates) for you.Q: A team member accidentally deleted rows from a production table. How do you recover?
-- See recent operations
DESCRIBE HISTORY production.sales.orders;
-- Verify the data in the previous version
SELECT COUNT(*) FROM production.sales.orders
VERSION AS OF 42;
-- Option 1: Restore the entire table to that version
RESTORE TABLE production.sales.orders TO VERSION AS OF 42;
-- Option 2: Re-insert just the deleted rows
INSERT INTO production.sales.orders
SELECT * FROM production.sales.orders VERSION AS OF 42
EXCEPT
SELECT * FROM production.sales.orders;
RESTORE is usually the simplest option. It's a metadata-only operation — it doesn't copy data files, just updates the transaction log to point to the old files. This works as long as you haven't run VACUUM since the delete. After recovery, audit the situation: consider adding row filters or restricting DELETE permissions via Unity Catalog to prevent recurrence.