Spark SQL on Databricks

TL;DR

Spark SQL lets you query Delta Lake tables using familiar SQL syntax — but at massive scale. Databricks adds serverless SQL warehouses, temp views, CTEs, window functions, and Photon engine acceleration.

Explain Like I'm 12

You know how SQL is the language you use to ask a database questions? Spark SQL is that same language — SELECT, WHERE, GROUP BY — all the stuff you already know. The difference is what's behind the curtain. With regular SQL, one computer does all the work. With Spark SQL, your query gets chopped up and hundreds of computers work on it simultaneously, like having an army of librarians searching a library instead of just one. Databricks makes this even easier by giving you a place to type SQL and automatically managing all those computers for you.

Spark SQL vs Traditional SQL

If you already know PostgreSQL or MySQL, you're 80% of the way there. Spark SQL follows the ANSI SQL standard, so most of your queries will work without changes. But there are important differences worth understanding before you start writing production queries.

Feature Traditional SQL (PostgreSQL/MySQL) Spark SQL on Databricks
Execution engine Single-node query planner Distributed engine (Spark Catalyst optimizer + Photon)
Storage format Row-based (heap pages) Columnar (Delta Lake / Parquet)
Data size sweet spot Megabytes to low terabytes Gigabytes to petabytes
Indexes B-tree, hash, GIN, GiST No traditional indexes — uses partition pruning, Z-ORDER, data skipping
Transactions Full ACID with row-level locking ACID via Delta Lake (table-level, optimistic concurrency)
UPDATE/DELETE Native, row-level Supported on Delta tables (rewrites Parquet files under the hood)
Window functions Full support Full support (same syntax)
CTEs Full support Full support (same syntax)
Stored procedures PL/pgSQL, PL/SQL Not supported — use notebooks or workflows instead
Semi-structured data JSONB (Postgres), JSON type (MySQL) Native struct/array/map types + explode(), from_json()
Key insight: Spark SQL doesn't have traditional indexes. Instead of creating an index on a column, you partition your data by frequently filtered columns and use Z-ORDER for multi-column lookups. Think of it as organizing a filing cabinet by folder (partition) and then sorting papers within each folder (Z-ORDER).

Creating and Managing Tables

Tables in Databricks come in two flavors: managed and external. This distinction matters a lot when you DROP them.

Managed Tables

Databricks controls both the metadata and the data files. When you drop a managed table, everything gets deleted — metadata and data files. This is the default behavior.

-- Create a managed Delta table (default format)
CREATE TABLE sales.transactions (
  transaction_id   BIGINT,
  customer_id      BIGINT,
  product_name     STRING,
  amount           DECIMAL(10,2),
  transaction_date DATE
);

-- Create from a query result
CREATE TABLE sales.monthly_summary AS
SELECT
  date_trunc('month', transaction_date) AS month,
  COUNT(*)                              AS total_transactions,
  SUM(amount)                           AS total_revenue
FROM sales.transactions
GROUP BY 1;

External Tables

You own the data files. Databricks only manages the metadata. When you drop an external table, the data files stay in your cloud storage — only the catalog entry disappears.

-- External table pointing to existing data in S3/ADLS/GCS
CREATE TABLE sales.raw_events
USING DELTA
LOCATION 's3://my-data-lake/raw/events/';

-- External table with explicit schema
CREATE TABLE sales.partner_feed (
  partner_id   INT,
  event_type   STRING,
  payload      STRING,
  received_at  TIMESTAMP
)
USING DELTA
LOCATION 'abfss://[email protected]/partner/';
Watch out: DROP TABLE on a managed table deletes your data permanently. If you're working with production data, prefer external tables or use DROP TABLE IF EXISTS carefully. In Unity Catalog, you can also check ownership with DESCRIBE EXTENDED table_name to see whether a table is managed or external before dropping it.

Useful Table Commands

-- See table metadata (managed vs external, location, schema)
DESCRIBE EXTENDED sales.transactions;

-- Add columns to existing table
ALTER TABLE sales.transactions ADD COLUMNS (
  discount_pct DOUBLE,
  region       STRING
);

-- View table history (Delta Lake time travel)
DESCRIBE HISTORY sales.transactions;

-- Query a past version of the table
SELECT * FROM sales.transactions VERSION AS OF 5;
SELECT * FROM sales.transactions TIMESTAMP AS OF '2026-03-01';

Temporary Views

Temp views are named query results that exist only during your session. They don't create any files on disk — think of them as saved queries you can reference by name.

Session-Scoped Temp Views

Visible only in the notebook or SQL session that created them. As soon as you detach from the cluster or close the session, they vanish.

-- Create a temp view
CREATE OR REPLACE TEMP VIEW active_customers AS
SELECT customer_id, customer_name, last_order_date
FROM sales.customers
WHERE last_order_date >= date_sub(current_date(), 90);

-- Use it like a table
SELECT * FROM active_customers WHERE customer_name LIKE 'A%';

Global Temp Views

Shared across all notebooks attached to the same cluster. They live in a special global_temp schema and disappear when the cluster is restarted.

-- Create a global temp view
CREATE OR REPLACE GLOBAL TEMP VIEW daily_metrics AS
SELECT
  transaction_date,
  COUNT(DISTINCT customer_id) AS unique_customers,
  SUM(amount)                 AS revenue
FROM sales.transactions
GROUP BY transaction_date;

-- Query it (note the global_temp prefix)
SELECT * FROM global_temp.daily_metrics
WHERE transaction_date = current_date();
When to use which: Use a regular TEMP VIEW for intermediate calculations within a single notebook. Use a GLOBAL TEMP VIEW when multiple notebooks on the same cluster need to share a pre-computed result (e.g., a lookup table). If multiple teams or clusters need access, create a proper table or a persistent view instead.

CTEs and Subqueries

Common Table Expressions (CTEs) are one of the best things to happen to SQL readability. They let you name your intermediate results and build complex queries step by step, like building with blocks.

Basic CTE

WITH monthly_revenue AS (
  SELECT
    date_trunc('month', transaction_date) AS month,
    SUM(amount)                           AS revenue
  FROM sales.transactions
  GROUP BY 1
)
SELECT
  month,
  revenue,
  LAG(revenue) OVER (ORDER BY month) AS prev_month_revenue,
  ROUND(
    (revenue - LAG(revenue) OVER (ORDER BY month))
    / LAG(revenue) OVER (ORDER BY month) * 100, 2
  ) AS growth_pct
FROM monthly_revenue
ORDER BY month;

Chained CTEs

You can stack CTEs — each one can reference the previous ones. This keeps your logic flat and readable instead of deeply nested.

WITH raw_orders AS (
  SELECT * FROM sales.transactions
  WHERE transaction_date >= '2026-01-01'
),
customer_totals AS (
  SELECT
    customer_id,
    COUNT(*)    AS order_count,
    SUM(amount) AS total_spent
  FROM raw_orders
  GROUP BY customer_id
),
customer_segments AS (
  SELECT *,
    CASE
      WHEN total_spent >= 10000 THEN 'VIP'
      WHEN total_spent >= 1000  THEN 'Regular'
      ELSE 'Occasional'
    END AS segment
  FROM customer_totals
)
SELECT segment, COUNT(*) AS customers, AVG(total_spent) AS avg_spent
FROM customer_segments
GROUP BY segment
ORDER BY avg_spent DESC;

Correlated Subqueries

These reference columns from the outer query. They work in Spark SQL but can be slow because the subquery may execute once per outer row. Prefer JOINs or window functions when possible.

-- Find each customer's most recent transaction
-- (correlated subquery approach)
SELECT t.*
FROM sales.transactions t
WHERE t.transaction_date = (
  SELECT MAX(t2.transaction_date)
  FROM sales.transactions t2
  WHERE t2.customer_id = t.customer_id
);

-- Better alternative using a window function
WITH ranked AS (
  SELECT *,
    ROW_NUMBER() OVER (
      PARTITION BY customer_id ORDER BY transaction_date DESC
    ) AS rn
  FROM sales.transactions
)
SELECT * FROM ranked WHERE rn = 1;
Performance note: Spark's Catalyst optimizer can sometimes rewrite correlated subqueries as joins internally, but it doesn't always succeed. If your correlated subquery is slow, rewrite it yourself as a JOIN or window function. The optimizer will thank you — and so will your cluster.

Window Functions

Window functions are where SQL gets its superpowers. They let you perform calculations across rows without collapsing them — unlike GROUP BY, you keep every row and add computed columns.

The Syntax Pattern

function_name(column) OVER (
  PARTITION BY partition_column   -- like GROUP BY, but keeps rows
  ORDER BY sort_column            -- ordering within each partition
  ROWS BETWEEN start AND end     -- optional frame specification
)

Ranking Functions

SELECT
  customer_id,
  product_name,
  amount,
  -- Different ranking behaviors for ties
  ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY amount DESC) AS row_num,
  RANK()       OVER (PARTITION BY customer_id ORDER BY amount DESC) AS rank,
  DENSE_RANK() OVER (PARTITION BY customer_id ORDER BY amount DESC) AS dense_rank
FROM sales.transactions;

ROW_NUMBER always gives unique numbers (1, 2, 3). RANK skips numbers after ties (1, 1, 3). DENSE_RANK doesn't skip (1, 1, 2). Which one you pick depends on whether you care about gaps.

LAG and LEAD — Look Backward and Forward

-- Month-over-month revenue growth
WITH monthly AS (
  SELECT
    date_trunc('month', transaction_date) AS month,
    SUM(amount) AS revenue
  FROM sales.transactions
  GROUP BY 1
)
SELECT
  month,
  revenue,
  LAG(revenue, 1)  OVER (ORDER BY month) AS prev_month,
  LEAD(revenue, 1) OVER (ORDER BY month) AS next_month,
  ROUND(
    (revenue - LAG(revenue, 1) OVER (ORDER BY month))
    / LAG(revenue, 1) OVER (ORDER BY month) * 100, 2
  ) AS mom_growth_pct
FROM monthly
ORDER BY month;

Running Totals and Moving Averages

SELECT
  transaction_date,
  amount,
  -- Running total (all rows from start up to current)
  SUM(amount) OVER (
    ORDER BY transaction_date
    ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
  ) AS running_total,
  -- 7-day moving average
  AVG(amount) OVER (
    ORDER BY transaction_date
    ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
  ) AS moving_avg_7d
FROM sales.daily_revenue
ORDER BY transaction_date;

Real-World Example: Top 3 Products per Region

WITH product_sales AS (
  SELECT
    region,
    product_name,
    SUM(amount) AS total_sales
  FROM sales.transactions
  GROUP BY region, product_name
),
ranked AS (
  SELECT *,
    ROW_NUMBER() OVER (
      PARTITION BY region ORDER BY total_sales DESC
    ) AS rank
  FROM product_sales
)
SELECT region, product_name, total_sales, rank
FROM ranked
WHERE rank <= 3
ORDER BY region, rank;
Pro tip: Define a named window when you reuse the same OVER clause multiple times. It keeps your query DRY:
SELECT
  customer_id,
  amount,
  ROW_NUMBER() OVER w AS row_num,
  SUM(amount)  OVER w AS running_total
FROM sales.transactions
WINDOW w AS (PARTITION BY customer_id ORDER BY transaction_date);

Working with Semi-Structured Data

Real-world data is messy. APIs return nested JSON, event streams have arrays inside arrays, and configuration data has arbitrary key-value pairs. Spark SQL has first-class tools for all of this.

Querying Nested Fields

-- If you have a struct column, access nested fields with dot notation
SELECT
  event_id,
  user.name       AS user_name,
  user.email      AS user_email,
  user.address.city AS city
FROM events.raw_stream;

explode() — Flatten Arrays into Rows

-- Given a column "items" that is an ARRAY of STRUCTs
SELECT
  order_id,
  item.product_name,
  item.quantity,
  item.price
FROM sales.orders
LATERAL VIEW explode(items) AS item;

-- Or using the newer SQL syntax
SELECT
  order_id,
  exploded_item.*
FROM sales.orders,
LATERAL VIEW explode(items) t AS exploded_item;

from_json() — Parse JSON Strings

-- If your JSON is stored as a plain STRING column
-- First, infer the schema from a sample
SELECT schema_of_json('{"name":"Alice","scores":[90,85,92]}');
-- Returns: STRUCT<name: STRING, scores: ARRAY<BIGINT>>

-- Then parse it
SELECT
  event_id,
  parsed.name,
  parsed.scores
FROM (
  SELECT
    event_id,
    from_json(
      payload,
      'STRUCT<name: STRING, scores: ARRAY<BIGINT>>'
    ) AS parsed
  FROM events.raw_json
);

Useful Functions for Semi-Structured Data

Function What It Does Example
explode(array) One row per array element explode(items)
posexplode(array) Same, but with index position posexplode(tags)
explode(map) One row per key-value pair explode(properties)
from_json(str, schema) Parse JSON string into struct from_json(payload, schema)
to_json(struct) Convert struct back to JSON string to_json(user)
schema_of_json(str) Infer schema from a JSON sample schema_of_json('{"a":1}')
get_json_object(str, path) Extract value using JSONPath get_json_object(data, '$.name')
collect_list(col) Aggregate rows back into an array collect_list(product_name)
Delta Lake advantage: Unlike raw JSON files, Delta tables enforce a schema. When you write nested data to Delta, the schema is stored in the transaction log. If someone tries to insert data with a different structure, Delta will reject it (unless you enable schema evolution with MERGE ... SET * WITH SCHEMA EVOLUTION). This catches data quality issues early instead of at query time.

Performance Tuning

Writing correct SQL is step one. Writing fast SQL on a distributed system takes a few extra tricks. Here's what actually moves the needle on Databricks.

Photon Engine

Photon is Databricks' native C++ execution engine that replaces parts of the JVM-based Spark executor. It's particularly fast for:

  • Scans and filters on Delta/Parquet data
  • Aggregations (GROUP BY, SUM, COUNT)
  • Joins (especially hash joins)
  • String and date operations

You don't need to change your SQL to use Photon — just select a Photon-enabled cluster or SQL warehouse. The optimizer automatically routes eligible operations to the Photon engine.

Adaptive Query Execution (AQE)

AQE is Spark's ability to change the query plan at runtime based on actual data statistics. It's enabled by default on Databricks and handles three big problems automatically:

  • Coalescing shuffle partitions — merges too-small partitions after a shuffle to avoid the overhead of tiny tasks
  • Converting sort-merge joins to broadcast joins — if one side of a join turns out to be small after filtering, AQE switches to the faster broadcast strategy
  • Handling skewed joins — splits oversized partitions so no single task takes forever

Partition Pruning

If your table is partitioned by date, always include a WHERE filter on that column. Spark skips entire directories of data it doesn't need to read.

-- Partition your table by date
CREATE TABLE sales.events (
  event_id    BIGINT,
  event_type  STRING,
  payload     STRING,
  event_date  DATE
)
USING DELTA
PARTITIONED BY (event_date);

-- This query only reads 1 day's partition instead of all data
SELECT * FROM sales.events
WHERE event_date = '2026-03-15';

Z-ORDER for Multi-Column Lookups

-- Co-locate data by frequently filtered columns
OPTIMIZE sales.transactions
ZORDER BY (customer_id, product_name);

-- Now this query benefits from data skipping
SELECT * FROM sales.transactions
WHERE customer_id = 12345 AND product_name = 'Widget Pro';

Caching

-- Cache a table in memory for repeated access
CACHE TABLE sales.dim_products;

-- Uncache when you're done
UNCACHE TABLE sales.dim_products;

-- Check what's cached
SHOW TABLES IN global_temp; -- shows cached views

Use EXPLAIN to Diagnose

-- See what Spark will actually do
EXPLAIN EXTENDED
SELECT customer_id, SUM(amount)
FROM sales.transactions
WHERE transaction_date >= '2026-01-01'
GROUP BY customer_id;

-- Look for these in the output:
-- PartitionFilters: good (partition pruning working)
-- BroadcastHashJoin: good for small tables
-- SortMergeJoin: expect for large-large joins
-- Exchange: shuffle happening (expensive but sometimes necessary)
The small file problem: If your ingestion pipeline creates thousands of tiny files (under 128 MB each), query performance degrades badly. Each file means a separate task, and the overhead of scheduling those tasks dominates actual work. Fix this by running OPTIMIZE sales.my_table regularly to compact files, or enable auto-compaction: ALTER TABLE sales.my_table SET TBLPROPERTIES ('delta.autoOptimize.optimizeWrite' = 'true', 'delta.autoOptimize.autoCompact' = 'true').
Always start with EXPLAIN: Before tuning anything, run EXPLAIN EXTENDED on your slow query. Nine times out of ten, the plan reveals the bottleneck: a missing partition filter, a sort-merge join that could be a broadcast join, or a skewed shuffle. Don't guess — read the plan.

Databricks SQL vs Notebook SQL

You can run SQL in two places on Databricks: the SQL Editor (powered by a SQL warehouse) and a notebook cell (powered by an all-purpose or jobs cluster). They use the same Spark SQL dialect, but the use cases and economics are different.

Aspect Databricks SQL (SQL Warehouse) Notebook SQL (All-Purpose Cluster)
Target user Analysts, BI consumers Engineers, data scientists
Compute Serverless or classic SQL warehouse All-purpose or jobs cluster
Startup time Serverless: seconds; Classic: minutes Minutes (cluster has to start)
Photon Always on Only on Photon-enabled clusters
Languages SQL only SQL, Python, Scala, R in same notebook
Dashboards Built-in dashboard builder No built-in dashboards
BI tool integration Native JDBC/ODBC endpoint for Tableau, Power BI Possible but not typical
Cost model Per-query or per-hour (warehouse sizing) Per-hour (cluster runs until terminated)
Alerts Built-in SQL alerts (email/Slack when a condition is met) No built-in alerting

Rule of thumb: If you're writing ad-hoc queries, building dashboards, or connecting BI tools — use a SQL warehouse. If you're building an ETL pipeline that mixes SQL with Python transformations, or doing data science — use a notebook on an all-purpose cluster.

Test Yourself

Q: What is the difference between a TEMP VIEW and a GLOBAL TEMP VIEW?

A TEMP VIEW is scoped to a single SparkSession (one notebook or SQL tab). A GLOBAL TEMP VIEW is visible across all notebooks attached to the same cluster and must be referenced via the global_temp schema (e.g., SELECT * FROM global_temp.my_view). Both disappear when the cluster restarts.

Q: Write a window function that assigns a rank to each customer's orders by amount (highest first), resetting the rank for each customer.

RANK() OVER (PARTITION BY customer_id ORDER BY amount DESC). Use ROW_NUMBER() instead if you need unique numbers even when amounts are tied. Use DENSE_RANK() if you want no gaps in the ranking sequence after ties.

Q: What happens when you DROP a managed table vs an external table?

Dropping a managed table deletes both the metadata (catalog entry) and the underlying data files. Dropping an external table deletes only the metadata — the data files remain in your cloud storage. You can verify which type you have with DESCRIBE EXTENDED table_name.

Q: How does the Photon engine improve query performance, and do you need to change your SQL to use it?

Photon is a native C++ execution engine that replaces JVM-based Spark operations for eligible queries. It accelerates scans, filters, aggregations, joins, and string operations. You do not need to change your SQL — just run your queries on a Photon-enabled cluster or SQL warehouse, and the optimizer routes eligible operations to Photon automatically.

Q: Why are CTEs preferred over deeply nested subqueries in Spark SQL?

CTEs (WITH clauses) make complex queries more readable and maintainable by naming intermediate results. Instead of reading inside-out, you read top-to-bottom. Performance is generally the same since Spark's Catalyst optimizer inlines CTEs. However, readability matters for debugging, code review, and collaboration — especially in production pipelines.

Interview Questions

Q: Explain how Spark SQL processes a query from the moment you hit "Run" to the final result. What is the role of the Catalyst optimizer?

When you submit a SQL query, Spark goes through four phases: (1) Parsing — converts SQL text into an unresolved logical plan (an abstract syntax tree). (2) Analysis — resolves table names, column names, and data types using the catalog. (3) Optimization — the Catalyst optimizer applies rule-based and cost-based optimizations (predicate pushdown, constant folding, join reordering, column pruning). (4) Physical planning — Catalyst generates multiple physical plans, picks the best one using cost models, and generates JVM bytecode (or routes to Photon). The result is a DAG of stages and tasks distributed across the cluster.

Q: You have a Spark SQL query that joins a 500 GB fact table with a 50 MB dimension table. How would you optimize this join?

Since the dimension table is small (50 MB is well under the default broadcast threshold of 10 MB, but you can increase it), you should use a broadcast hash join. Use the hint: SELECT /*+ BROADCAST(dim) */ ... FROM fact JOIN dim ON ..., or set spark.sql.autoBroadcastJoinThreshold to at least 50 MB. The small table gets copied to every executor, eliminating the expensive shuffle of the large table. With Adaptive Query Execution enabled on Databricks, Spark may do this automatically if it detects the dimension table is small after runtime filtering.

Q: What is the difference between partition pruning and data skipping in Delta Lake, and when does each apply?

Partition pruning skips entire directories based on the partition column in your WHERE clause (e.g., filtering on event_date when the table is PARTITIONED BY (event_date)). Data skipping uses min/max statistics stored in the Delta transaction log to skip individual Parquet files even within a partition. Z-ORDER enhances data skipping by co-locating related values in the same files. Partition pruning applies before any file is read; data skipping applies when Spark checks file-level metadata before opening each file.

Q: A colleague writes a query with a correlated subquery that runs for 20 minutes. How would you diagnose and fix it?

First, run EXPLAIN EXTENDED to see the physical plan. Correlated subqueries sometimes get rewritten as joins by the Catalyst optimizer, but when they don't, each outer row triggers a separate subquery execution. The fix: rewrite the correlated subquery as either (1) a JOIN with a pre-aggregated subquery, or (2) a window function. For example, WHERE amount = (SELECT MAX(amount) ... WHERE t2.id = t1.id) can be rewritten as ROW_NUMBER() OVER (PARTITION BY id ORDER BY amount DESC) = 1. Also check if the table is partitioned and whether the query triggers a full table scan.

Q: When would you choose a SQL warehouse over an all-purpose cluster for running SQL workloads on Databricks?

Choose a SQL warehouse when: (1) the workload is pure SQL (no Python/Scala needed), (2) you need BI tool connectivity via JDBC/ODBC (Tableau, Power BI), (3) you want serverless startup in seconds, (4) you're building Databricks SQL dashboards or alerts, (5) cost optimization matters (warehouses auto-suspend and charge per-query in serverless mode). Choose an all-purpose cluster when: (1) you need mixed-language notebooks (SQL + Python), (2) you're doing ETL development with iterative testing, (3) you need libraries or custom packages installed, (4) you're running ML workloads alongside SQL.