Spark SQL on Databricks
Spark SQL lets you query Delta Lake tables using familiar SQL syntax — but at massive scale. Databricks adds serverless SQL warehouses, temp views, CTEs, window functions, and Photon engine acceleration.
Explain Like I'm 12
You know how SQL is the language you use to ask a database questions? Spark SQL is that same language — SELECT, WHERE, GROUP BY — all the stuff you already know. The difference is what's behind the curtain. With regular SQL, one computer does all the work. With Spark SQL, your query gets chopped up and hundreds of computers work on it simultaneously, like having an army of librarians searching a library instead of just one. Databricks makes this even easier by giving you a place to type SQL and automatically managing all those computers for you.
Spark SQL vs Traditional SQL
If you already know PostgreSQL or MySQL, you're 80% of the way there. Spark SQL follows the ANSI SQL standard, so most of your queries will work without changes. But there are important differences worth understanding before you start writing production queries.
| Feature | Traditional SQL (PostgreSQL/MySQL) | Spark SQL on Databricks |
|---|---|---|
| Execution engine | Single-node query planner | Distributed engine (Spark Catalyst optimizer + Photon) |
| Storage format | Row-based (heap pages) | Columnar (Delta Lake / Parquet) |
| Data size sweet spot | Megabytes to low terabytes | Gigabytes to petabytes |
| Indexes | B-tree, hash, GIN, GiST | No traditional indexes — uses partition pruning, Z-ORDER, data skipping |
| Transactions | Full ACID with row-level locking | ACID via Delta Lake (table-level, optimistic concurrency) |
| UPDATE/DELETE | Native, row-level | Supported on Delta tables (rewrites Parquet files under the hood) |
| Window functions | Full support | Full support (same syntax) |
| CTEs | Full support | Full support (same syntax) |
| Stored procedures | PL/pgSQL, PL/SQL | Not supported — use notebooks or workflows instead |
| Semi-structured data | JSONB (Postgres), JSON type (MySQL) | Native struct/array/map types + explode(), from_json() |
Creating and Managing Tables
Tables in Databricks come in two flavors: managed and external. This distinction matters a lot when you DROP them.
Managed Tables
Databricks controls both the metadata and the data files. When you drop a managed table, everything gets deleted — metadata and data files. This is the default behavior.
-- Create a managed Delta table (default format)
CREATE TABLE sales.transactions (
transaction_id BIGINT,
customer_id BIGINT,
product_name STRING,
amount DECIMAL(10,2),
transaction_date DATE
);
-- Create from a query result
CREATE TABLE sales.monthly_summary AS
SELECT
date_trunc('month', transaction_date) AS month,
COUNT(*) AS total_transactions,
SUM(amount) AS total_revenue
FROM sales.transactions
GROUP BY 1;
External Tables
You own the data files. Databricks only manages the metadata. When you drop an external table, the data files stay in your cloud storage — only the catalog entry disappears.
-- External table pointing to existing data in S3/ADLS/GCS
CREATE TABLE sales.raw_events
USING DELTA
LOCATION 's3://my-data-lake/raw/events/';
-- External table with explicit schema
CREATE TABLE sales.partner_feed (
partner_id INT,
event_type STRING,
payload STRING,
received_at TIMESTAMP
)
USING DELTA
LOCATION 'abfss://[email protected]/partner/';
DROP TABLE on a managed table deletes your data permanently. If you're working with production data, prefer external tables or use DROP TABLE IF EXISTS carefully. In Unity Catalog, you can also check ownership with DESCRIBE EXTENDED table_name to see whether a table is managed or external before dropping it.
Useful Table Commands
-- See table metadata (managed vs external, location, schema)
DESCRIBE EXTENDED sales.transactions;
-- Add columns to existing table
ALTER TABLE sales.transactions ADD COLUMNS (
discount_pct DOUBLE,
region STRING
);
-- View table history (Delta Lake time travel)
DESCRIBE HISTORY sales.transactions;
-- Query a past version of the table
SELECT * FROM sales.transactions VERSION AS OF 5;
SELECT * FROM sales.transactions TIMESTAMP AS OF '2026-03-01';
Temporary Views
Temp views are named query results that exist only during your session. They don't create any files on disk — think of them as saved queries you can reference by name.
Session-Scoped Temp Views
Visible only in the notebook or SQL session that created them. As soon as you detach from the cluster or close the session, they vanish.
-- Create a temp view
CREATE OR REPLACE TEMP VIEW active_customers AS
SELECT customer_id, customer_name, last_order_date
FROM sales.customers
WHERE last_order_date >= date_sub(current_date(), 90);
-- Use it like a table
SELECT * FROM active_customers WHERE customer_name LIKE 'A%';
Global Temp Views
Shared across all notebooks attached to the same cluster. They live in a special global_temp schema and disappear when the cluster is restarted.
-- Create a global temp view
CREATE OR REPLACE GLOBAL TEMP VIEW daily_metrics AS
SELECT
transaction_date,
COUNT(DISTINCT customer_id) AS unique_customers,
SUM(amount) AS revenue
FROM sales.transactions
GROUP BY transaction_date;
-- Query it (note the global_temp prefix)
SELECT * FROM global_temp.daily_metrics
WHERE transaction_date = current_date();
TEMP VIEW for intermediate calculations within a single notebook. Use a GLOBAL TEMP VIEW when multiple notebooks on the same cluster need to share a pre-computed result (e.g., a lookup table). If multiple teams or clusters need access, create a proper table or a persistent view instead.
CTEs and Subqueries
Common Table Expressions (CTEs) are one of the best things to happen to SQL readability. They let you name your intermediate results and build complex queries step by step, like building with blocks.
Basic CTE
WITH monthly_revenue AS (
SELECT
date_trunc('month', transaction_date) AS month,
SUM(amount) AS revenue
FROM sales.transactions
GROUP BY 1
)
SELECT
month,
revenue,
LAG(revenue) OVER (ORDER BY month) AS prev_month_revenue,
ROUND(
(revenue - LAG(revenue) OVER (ORDER BY month))
/ LAG(revenue) OVER (ORDER BY month) * 100, 2
) AS growth_pct
FROM monthly_revenue
ORDER BY month;
Chained CTEs
You can stack CTEs — each one can reference the previous ones. This keeps your logic flat and readable instead of deeply nested.
WITH raw_orders AS (
SELECT * FROM sales.transactions
WHERE transaction_date >= '2026-01-01'
),
customer_totals AS (
SELECT
customer_id,
COUNT(*) AS order_count,
SUM(amount) AS total_spent
FROM raw_orders
GROUP BY customer_id
),
customer_segments AS (
SELECT *,
CASE
WHEN total_spent >= 10000 THEN 'VIP'
WHEN total_spent >= 1000 THEN 'Regular'
ELSE 'Occasional'
END AS segment
FROM customer_totals
)
SELECT segment, COUNT(*) AS customers, AVG(total_spent) AS avg_spent
FROM customer_segments
GROUP BY segment
ORDER BY avg_spent DESC;
Correlated Subqueries
These reference columns from the outer query. They work in Spark SQL but can be slow because the subquery may execute once per outer row. Prefer JOINs or window functions when possible.
-- Find each customer's most recent transaction
-- (correlated subquery approach)
SELECT t.*
FROM sales.transactions t
WHERE t.transaction_date = (
SELECT MAX(t2.transaction_date)
FROM sales.transactions t2
WHERE t2.customer_id = t.customer_id
);
-- Better alternative using a window function
WITH ranked AS (
SELECT *,
ROW_NUMBER() OVER (
PARTITION BY customer_id ORDER BY transaction_date DESC
) AS rn
FROM sales.transactions
)
SELECT * FROM ranked WHERE rn = 1;
Window Functions
Window functions are where SQL gets its superpowers. They let you perform calculations across rows without collapsing them — unlike GROUP BY, you keep every row and add computed columns.
The Syntax Pattern
function_name(column) OVER (
PARTITION BY partition_column -- like GROUP BY, but keeps rows
ORDER BY sort_column -- ordering within each partition
ROWS BETWEEN start AND end -- optional frame specification
)
Ranking Functions
SELECT
customer_id,
product_name,
amount,
-- Different ranking behaviors for ties
ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY amount DESC) AS row_num,
RANK() OVER (PARTITION BY customer_id ORDER BY amount DESC) AS rank,
DENSE_RANK() OVER (PARTITION BY customer_id ORDER BY amount DESC) AS dense_rank
FROM sales.transactions;
ROW_NUMBER always gives unique numbers (1, 2, 3). RANK skips numbers after ties (1, 1, 3). DENSE_RANK doesn't skip (1, 1, 2). Which one you pick depends on whether you care about gaps.
LAG and LEAD — Look Backward and Forward
-- Month-over-month revenue growth
WITH monthly AS (
SELECT
date_trunc('month', transaction_date) AS month,
SUM(amount) AS revenue
FROM sales.transactions
GROUP BY 1
)
SELECT
month,
revenue,
LAG(revenue, 1) OVER (ORDER BY month) AS prev_month,
LEAD(revenue, 1) OVER (ORDER BY month) AS next_month,
ROUND(
(revenue - LAG(revenue, 1) OVER (ORDER BY month))
/ LAG(revenue, 1) OVER (ORDER BY month) * 100, 2
) AS mom_growth_pct
FROM monthly
ORDER BY month;
Running Totals and Moving Averages
SELECT
transaction_date,
amount,
-- Running total (all rows from start up to current)
SUM(amount) OVER (
ORDER BY transaction_date
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
) AS running_total,
-- 7-day moving average
AVG(amount) OVER (
ORDER BY transaction_date
ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
) AS moving_avg_7d
FROM sales.daily_revenue
ORDER BY transaction_date;
Real-World Example: Top 3 Products per Region
WITH product_sales AS (
SELECT
region,
product_name,
SUM(amount) AS total_sales
FROM sales.transactions
GROUP BY region, product_name
),
ranked AS (
SELECT *,
ROW_NUMBER() OVER (
PARTITION BY region ORDER BY total_sales DESC
) AS rank
FROM product_sales
)
SELECT region, product_name, total_sales, rank
FROM ranked
WHERE rank <= 3
ORDER BY region, rank;
OVER clause multiple times. It keeps your query DRY:
SELECT
customer_id,
amount,
ROW_NUMBER() OVER w AS row_num,
SUM(amount) OVER w AS running_total
FROM sales.transactions
WINDOW w AS (PARTITION BY customer_id ORDER BY transaction_date);
Working with Semi-Structured Data
Real-world data is messy. APIs return nested JSON, event streams have arrays inside arrays, and configuration data has arbitrary key-value pairs. Spark SQL has first-class tools for all of this.
Querying Nested Fields
-- If you have a struct column, access nested fields with dot notation
SELECT
event_id,
user.name AS user_name,
user.email AS user_email,
user.address.city AS city
FROM events.raw_stream;
explode() — Flatten Arrays into Rows
-- Given a column "items" that is an ARRAY of STRUCTs
SELECT
order_id,
item.product_name,
item.quantity,
item.price
FROM sales.orders
LATERAL VIEW explode(items) AS item;
-- Or using the newer SQL syntax
SELECT
order_id,
exploded_item.*
FROM sales.orders,
LATERAL VIEW explode(items) t AS exploded_item;
from_json() — Parse JSON Strings
-- If your JSON is stored as a plain STRING column
-- First, infer the schema from a sample
SELECT schema_of_json('{"name":"Alice","scores":[90,85,92]}');
-- Returns: STRUCT<name: STRING, scores: ARRAY<BIGINT>>
-- Then parse it
SELECT
event_id,
parsed.name,
parsed.scores
FROM (
SELECT
event_id,
from_json(
payload,
'STRUCT<name: STRING, scores: ARRAY<BIGINT>>'
) AS parsed
FROM events.raw_json
);
Useful Functions for Semi-Structured Data
| Function | What It Does | Example |
|---|---|---|
explode(array) |
One row per array element | explode(items) |
posexplode(array) |
Same, but with index position | posexplode(tags) |
explode(map) |
One row per key-value pair | explode(properties) |
from_json(str, schema) |
Parse JSON string into struct | from_json(payload, schema) |
to_json(struct) |
Convert struct back to JSON string | to_json(user) |
schema_of_json(str) |
Infer schema from a JSON sample | schema_of_json('{"a":1}') |
get_json_object(str, path) |
Extract value using JSONPath | get_json_object(data, '$.name') |
collect_list(col) |
Aggregate rows back into an array | collect_list(product_name) |
MERGE ... SET * WITH SCHEMA EVOLUTION). This catches data quality issues early instead of at query time.
Performance Tuning
Writing correct SQL is step one. Writing fast SQL on a distributed system takes a few extra tricks. Here's what actually moves the needle on Databricks.
Photon Engine
Photon is Databricks' native C++ execution engine that replaces parts of the JVM-based Spark executor. It's particularly fast for:
- Scans and filters on Delta/Parquet data
- Aggregations (
GROUP BY,SUM,COUNT) - Joins (especially hash joins)
- String and date operations
You don't need to change your SQL to use Photon — just select a Photon-enabled cluster or SQL warehouse. The optimizer automatically routes eligible operations to the Photon engine.
Adaptive Query Execution (AQE)
AQE is Spark's ability to change the query plan at runtime based on actual data statistics. It's enabled by default on Databricks and handles three big problems automatically:
- Coalescing shuffle partitions — merges too-small partitions after a shuffle to avoid the overhead of tiny tasks
- Converting sort-merge joins to broadcast joins — if one side of a join turns out to be small after filtering, AQE switches to the faster broadcast strategy
- Handling skewed joins — splits oversized partitions so no single task takes forever
Partition Pruning
If your table is partitioned by date, always include a WHERE filter on that column. Spark skips entire directories of data it doesn't need to read.
-- Partition your table by date
CREATE TABLE sales.events (
event_id BIGINT,
event_type STRING,
payload STRING,
event_date DATE
)
USING DELTA
PARTITIONED BY (event_date);
-- This query only reads 1 day's partition instead of all data
SELECT * FROM sales.events
WHERE event_date = '2026-03-15';
Z-ORDER for Multi-Column Lookups
-- Co-locate data by frequently filtered columns
OPTIMIZE sales.transactions
ZORDER BY (customer_id, product_name);
-- Now this query benefits from data skipping
SELECT * FROM sales.transactions
WHERE customer_id = 12345 AND product_name = 'Widget Pro';
Caching
-- Cache a table in memory for repeated access
CACHE TABLE sales.dim_products;
-- Uncache when you're done
UNCACHE TABLE sales.dim_products;
-- Check what's cached
SHOW TABLES IN global_temp; -- shows cached views
Use EXPLAIN to Diagnose
-- See what Spark will actually do
EXPLAIN EXTENDED
SELECT customer_id, SUM(amount)
FROM sales.transactions
WHERE transaction_date >= '2026-01-01'
GROUP BY customer_id;
-- Look for these in the output:
-- PartitionFilters: good (partition pruning working)
-- BroadcastHashJoin: good for small tables
-- SortMergeJoin: expect for large-large joins
-- Exchange: shuffle happening (expensive but sometimes necessary)
OPTIMIZE sales.my_table regularly to compact files, or enable auto-compaction: ALTER TABLE sales.my_table SET TBLPROPERTIES ('delta.autoOptimize.optimizeWrite' = 'true', 'delta.autoOptimize.autoCompact' = 'true').
EXPLAIN EXTENDED on your slow query. Nine times out of ten, the plan reveals the bottleneck: a missing partition filter, a sort-merge join that could be a broadcast join, or a skewed shuffle. Don't guess — read the plan.
Databricks SQL vs Notebook SQL
You can run SQL in two places on Databricks: the SQL Editor (powered by a SQL warehouse) and a notebook cell (powered by an all-purpose or jobs cluster). They use the same Spark SQL dialect, but the use cases and economics are different.
| Aspect | Databricks SQL (SQL Warehouse) | Notebook SQL (All-Purpose Cluster) |
|---|---|---|
| Target user | Analysts, BI consumers | Engineers, data scientists |
| Compute | Serverless or classic SQL warehouse | All-purpose or jobs cluster |
| Startup time | Serverless: seconds; Classic: minutes | Minutes (cluster has to start) |
| Photon | Always on | Only on Photon-enabled clusters |
| Languages | SQL only | SQL, Python, Scala, R in same notebook |
| Dashboards | Built-in dashboard builder | No built-in dashboards |
| BI tool integration | Native JDBC/ODBC endpoint for Tableau, Power BI | Possible but not typical |
| Cost model | Per-query or per-hour (warehouse sizing) | Per-hour (cluster runs until terminated) |
| Alerts | Built-in SQL alerts (email/Slack when a condition is met) | No built-in alerting |
Rule of thumb: If you're writing ad-hoc queries, building dashboards, or connecting BI tools — use a SQL warehouse. If you're building an ETL pipeline that mixes SQL with Python transformations, or doing data science — use a notebook on an all-purpose cluster.
Test Yourself
Q: What is the difference between a TEMP VIEW and a GLOBAL TEMP VIEW?
global_temp schema (e.g., SELECT * FROM global_temp.my_view). Both disappear when the cluster restarts.Q: Write a window function that assigns a rank to each customer's orders by amount (highest first), resetting the rank for each customer.
RANK() OVER (PARTITION BY customer_id ORDER BY amount DESC). Use ROW_NUMBER() instead if you need unique numbers even when amounts are tied. Use DENSE_RANK() if you want no gaps in the ranking sequence after ties.Q: What happens when you DROP a managed table vs an external table?
DESCRIBE EXTENDED table_name.Q: How does the Photon engine improve query performance, and do you need to change your SQL to use it?
Q: Why are CTEs preferred over deeply nested subqueries in Spark SQL?
WITH clauses) make complex queries more readable and maintainable by naming intermediate results. Instead of reading inside-out, you read top-to-bottom. Performance is generally the same since Spark's Catalyst optimizer inlines CTEs. However, readability matters for debugging, code review, and collaboration — especially in production pipelines.Interview Questions
Q: Explain how Spark SQL processes a query from the moment you hit "Run" to the final result. What is the role of the Catalyst optimizer?
Q: You have a Spark SQL query that joins a 500 GB fact table with a 50 MB dimension table. How would you optimize this join?
SELECT /*+ BROADCAST(dim) */ ... FROM fact JOIN dim ON ..., or set spark.sql.autoBroadcastJoinThreshold to at least 50 MB. The small table gets copied to every executor, eliminating the expensive shuffle of the large table. With Adaptive Query Execution enabled on Databricks, Spark may do this automatically if it detects the dimension table is small after runtime filtering.Q: What is the difference between partition pruning and data skipping in Delta Lake, and when does each apply?
WHERE clause (e.g., filtering on event_date when the table is PARTITIONED BY (event_date)). Data skipping uses min/max statistics stored in the Delta transaction log to skip individual Parquet files even within a partition. Z-ORDER enhances data skipping by co-locating related values in the same files. Partition pruning applies before any file is read; data skipping applies when Spark checks file-level metadata before opening each file.Q: A colleague writes a query with a correlated subquery that runs for 20 minutes. How would you diagnose and fix it?
EXPLAIN EXTENDED to see the physical plan. Correlated subqueries sometimes get rewritten as joins by the Catalyst optimizer, but when they don't, each outer row triggers a separate subquery execution. The fix: rewrite the correlated subquery as either (1) a JOIN with a pre-aggregated subquery, or (2) a window function. For example, WHERE amount = (SELECT MAX(amount) ... WHERE t2.id = t1.id) can be rewritten as ROW_NUMBER() OVER (PARTITION BY id ORDER BY amount DESC) = 1. Also check if the table is partitioned and whether the query triggers a full table scan.Q: When would you choose a SQL warehouse over an all-purpose cluster for running SQL workloads on Databricks?