Databricks Core Concepts

TL;DR

Databricks has 8 building blocks: Lakehouse architecture, Delta Lake storage, Unity Catalog governance, Clusters for compute, Notebooks for coding, Workflows for orchestration, Databricks SQL for BI, and MLflow for ML lifecycle. Master these and you understand the whole platform.

Databricks platform: 8 building blocks arranged around the lakehouse core
Explain Like I'm 12

Think of Databricks as a giant LEGO set for data. The Lakehouse is the base plate. Delta Lake is how you store your bricks neatly (not thrown in a pile). Unity Catalog is the instruction manual that says who can use which bricks. Clusters are the workers who snap bricks together. Notebooks are where you draw your designs. Workflows tell workers what order to build in. Databricks SQL lets business people ask questions about the finished model. And MLflow is the quality control team that tests if the model is good.

Cheat Sheet

ConceptWhat It IsPlain English
LakehouseArchitecture combining lake + warehouseOne system for all data — no more copying between lake and warehouse
Delta LakeOpen storage layer with ACID transactionsYour data lake becomes reliable — no more corrupted or partial files
Unity CatalogUnified governance for data and AIOne place to control who can see what, plus lineage and discovery
ClusterGroup of VMs that run your Spark jobsThe compute power — scales up when you need it, shuts down when you don't
NotebookInteractive document with code + resultsLike a Google Doc but for code — write SQL, Python, or Scala and see results inline
WorkflowOrchestrated sequence of tasksAutomate: "Run this notebook, then that one, every day at 6 AM"
Databricks SQLServerless SQL analytics engineA fast SQL warehouse for BI tools — connect Power BI, Tableau, etc.
MLflowML experiment tracking and model registryTrack which model version works best, then deploy the winner

1. Lakehouse Architecture

The lakehouse is the foundational idea behind Databricks. Instead of maintaining a data lake (raw files in cloud storage) AND a data warehouse (structured, expensive), you use one system that gives you both.

Key insight: Your data stays in open formats (Parquet, Delta) on cheap cloud storage (S3, ADLS, GCS). Databricks adds a metadata and transaction layer on top — giving you warehouse-grade reliability without warehouse-grade costs.
LayerWhat Lives Here
BronzeRaw data exactly as ingested — no transformations
SilverCleaned, deduplicated, joined — conforming to schemas
GoldAggregated, business-ready — ready for dashboards and ML

2. Delta Lake

Delta Lake is the storage format that makes the lakehouse possible. It's an open-source project that adds ACID transactions, schema enforcement, and time travel to Parquet files.

-- Create a Delta table
CREATE TABLE sales (
  id BIGINT,
  product STRING,
  amount DECIMAL(10,2),
  sold_at TIMESTAMP
) USING DELTA;

-- Time travel: query yesterday's version
SELECT * FROM sales VERSION AS OF 42;

-- Restore to a previous version
RESTORE TABLE sales TO VERSION AS OF 42;

Deep dive: Delta Lake →

3. Unity Catalog

Unity Catalog is the governance layer. It provides a 3-level namespace: catalog.schema.table. You define access policies once, and they apply across all workspaces.

Think of it like: A filing cabinet (catalog) with drawers (schemas) holding folders (tables). Unity Catalog controls who can open which drawer.
-- Grant read access to a schema
GRANT SELECT ON SCHEMA analytics.finance TO `[email protected]`;

-- View data lineage
-- (Available in the Catalog Explorer UI)

4. Clusters

A cluster is a set of cloud VMs that run your Spark code. Databricks manages Spark for you — you just pick a size and go.

Cluster TypeWhen to Use
All-PurposeInteractive notebooks, development, ad-hoc analysis
Job ClusterAutomated workflows — spins up, runs the job, shuts down
SQL WarehouseBI queries from Tableau, Power BI, or Databricks SQL
Cost tip: Always use auto-termination (e.g., terminate after 30 min idle) on all-purpose clusters. Forgotten running clusters are the #1 cost surprise on Databricks.

5. Notebooks

Notebooks are your interactive workspace. Each cell can be Python, SQL, Scala, or R. Results render inline — tables, charts, and text.

# Python cell
df = spark.read.table("analytics.sales")
display(df.groupBy("product").sum("amount"))
-- SQL cell (use %sql magic command in Python notebooks)
SELECT product, SUM(amount) as total
FROM analytics.sales
GROUP BY product
ORDER BY total DESC

6. Workflows

Workflows let you orchestrate tasks: run notebooks, Python scripts, dbt models, or Spark JARs on a schedule or trigger. Each task gets its own job cluster (cost-efficient).

Workflows vs. Airflow: Databricks Workflows are built-in and tightly integrated (auto-retries, task dependencies, parameterized runs). Airflow is better if you need to orchestrate across systems beyond Databricks.

7. Databricks SQL

Databricks SQL is a serverless SQL engine optimized for BI workloads. It lets analysts query Delta tables with standard SQL, create dashboards, and connect BI tools.

FeatureDetails
SQL EditorWrite and run SQL queries in the browser
DashboardsBuild visualizations from query results
AlertsGet notified when a query result crosses a threshold
BI ConnectorsPartner Connect for Tableau, Power BI, Looker, etc.

8. MLflow

MLflow is an open-source platform for the ML lifecycle. On Databricks, it's pre-installed and integrated with Unity Catalog for model governance.

import mlflow

with mlflow.start_run():
    mlflow.log_param("learning_rate", 0.01)
    mlflow.log_metric("rmse", 0.85)
    mlflow.sklearn.log_model(model, "model")
MLflow tracks: Parameters (what you tried), metrics (how well it worked), artifacts (the model file), and source code (which notebook ran it).

Test Yourself

Q: What are the three layers of the Medallion architecture?

Bronze (raw ingested data), Silver (cleaned and conformed), Gold (aggregated, business-ready). Data flows Bronze → Silver → Gold with increasing quality at each stage.

Q: What's the difference between an all-purpose cluster and a job cluster?

All-purpose clusters stay running for interactive use (notebooks, exploration). Job clusters spin up for a specific workflow task and terminate when done — more cost-efficient for production pipelines.

Q: What namespace hierarchy does Unity Catalog use?

catalog.schema.table (three levels). For example: production.finance.invoices. You grant access at any level — catalog, schema, or individual table.

Q: How does Delta Lake differ from plain Parquet?

Delta Lake adds a transaction log on top of Parquet files. This gives you ACID transactions, time travel (query past versions), schema enforcement, and the ability to UPDATE/DELETE/MERGE — none of which plain Parquet supports.

Q: When would you use Databricks SQL instead of a notebook?

Databricks SQL is for BI workloads: analysts running SQL queries, building dashboards, connecting BI tools (Power BI, Tableau). Notebooks are for development, data engineering, and data science where you need Python/Scala and interactive experimentation.