Lakehouses & Warehouses in Microsoft Fabric

TL;DR

Fabric gives you two ways to store structured data: Lakehouse (files + Delta tables, queried with Spark or auto-generated SQL endpoint) and Warehouse (full T-SQL with INSERT/UPDATE/DELETE). Both store data as Delta Parquet in OneLake. Use a lakehouse for data engineering (Spark, Python), a warehouse for SQL-heavy analytics. Use Direct Lake to connect Power BI to either without importing data.

Explain Like I'm 12

A lakehouse is like a big toy chest where you can throw in anything (LEGOs, action figures, books) and a magic robot (Spark) sorts it for you. You can also look at the organized stuff through a window (SQL endpoint) but you can’t rearrange through that window. A warehouse is like a perfectly organized bookshelf where you can add books, move them, or remove them using labels (SQL). Both sit in the same room (OneLake).

Lakehouse architecture: raw data flows into Files, Delta Tables, and SQL Endpoint layers, then via Direct Lake to Power BI, with Warehouse as an alternative T-SQL path

Lakehouse Architecture

A Fabric lakehouse has three layers that are automatically kept in sync:

LayerWhat It ContainsHow You Access It
FilesRaw files (CSV, JSON, Parquet, images, etc.)Spark notebooks, ABFS paths
TablesDelta tables (Parquet + transaction log)Spark SQL, PySpark, SQL endpoint
SQL EndpointAuto-generated read-only T-SQL interface to Delta tablesSSMS, Azure Data Studio, any SQL client
Auto-discovery: When you write a Delta table in a notebook, it automatically appears in the SQL endpoint and the default semantic model. No manual registration needed.
# Create a Delta table from raw CSV files
df = spark.read.csv("Files/raw/customers.csv", header=True, inferSchema=True)

# Clean the data
df_clean = (df
    .dropDuplicates(["customer_id"])
    .withColumn("created_date", df["created_date"].cast("date"))
    .filter(df["customer_id"].isNotNull())
)

# Write as managed Delta table — auto-appears in SQL endpoint
df_clean.write.format("delta").mode("overwrite").saveAsTable("dim_customers")

Warehouse Deep Dive

The Fabric warehouse is a fully managed T-SQL data warehouse. Unlike the lakehouse SQL endpoint, it supports full DML operations:

-- Create a table in the warehouse
CREATE TABLE dbo.fact_sales (
    sale_id     BIGINT NOT NULL,
    customer_id BIGINT NOT NULL,
    product_id  INT NOT NULL,
    sale_date   DATE NOT NULL,
    amount      DECIMAL(18,2) NOT NULL,
    region      VARCHAR(50)
);

-- Insert data (not possible in lakehouse SQL endpoint!)
INSERT INTO dbo.fact_sales
SELECT * FROM bronze_lakehouse.dbo.raw_sales
WHERE sale_date >= '2024-01-01';

-- Cross-database query (warehouse exclusive)
SELECT w.sale_id, l.customer_name
FROM dbo.fact_sales w
JOIN lakehouse_db.dbo.dim_customers l
    ON w.customer_id = l.customer_id;
When to pick warehouse over lakehouse: Choose warehouse when your team is primarily SQL-based, you need INSERT/UPDATE/DELETE, you want stored procedures, or you need cross-database queries. Choose lakehouse when you need Spark/Python for complex transformations or you work with semi-structured data.

Lakehouse vs Warehouse: Decision Matrix

CriteriaLakehouseWarehouse
Primary languagePySpark, Spark SQLT-SQL
Write operationsSpark notebooks onlyFull T-SQL DML
SQL read accessSQL endpoint (read-only)Full T-SQL
Stored proceduresNoYes
Cross-database queriesNoYes
Semi-structured dataExcellent (JSON, XML, etc.)Limited
Data science / MLNative (Spark ML, MLflow)Not supported
Streaming ingestionSpark Structured StreamingNo
Storage formatDelta Parquet in OneLakeDelta Parquet in OneLake
Direct Lake in Power BIYesYes
Best forData engineers, data scientistsSQL analysts, BI developers
You can use both: A common pattern is lakehouse for Bronze/Silver (data engineering with Spark) and warehouse for Gold (SQL-based curated layer for analysts). They both sit in OneLake and can reference each other via shortcuts.

Delta Tables in Fabric

Every table in Fabric (lakehouse or warehouse) is stored as a Delta table — Parquet data files plus a JSON transaction log (_delta_log/). This gives you:

  • ACID transactions — no corrupt data from failed writes
  • Time travel — query data as it was at any point in time
  • Schema enforcement — reject writes that don’t match the table schema
  • Z-Order optimization — co-locate related rows for faster queries
# Time travel — read the table as it was 2 versions ago
df_old = spark.read.format("delta").option("versionAsOf", 2).table("fact_sales")

# Or by timestamp
df_yesterday = (spark.read.format("delta")
    .option("timestampAsOf", "2026-03-30")
    .table("fact_sales"))

# Optimize with Z-Order for common query patterns
spark.sql("OPTIMIZE fact_sales ZORDER BY (region, sale_date)")
V-Order: Fabric applies V-Order optimization automatically to Delta tables. V-Order is a write-time optimization that sorts and compresses Parquet row groups for faster reads by both Spark and the SQL engine. You don’t need to do anything — it’s on by default.

Direct Lake Mode

Direct Lake is Fabric’s breakthrough connection mode for Power BI. Instead of importing data or sending live queries, Power BI reads Delta Parquet files directly from OneLake into the VertiPaq engine’s memory on demand.

How Direct Lake Works

  1. A user opens a Power BI report
  2. The VertiPaq engine checks which Delta table columns are needed
  3. It reads the Parquet files from OneLake directly (no copy, no query to Spark/SQL)
  4. Data is loaded into columnar memory for lightning-fast DAX calculations
  5. When Delta tables are updated, the next report interaction picks up the new data
Fallback behavior: If Direct Lake can’t load the data (too large, unsupported data types, or a DAX query that needs features not yet supported), it falls back to DirectQuery mode automatically. Monitor this in the Performance Analyzer — frequent fallbacks mean you should optimize your tables.

Direct Lake Requirements

  • Data must be in Delta tables in a Fabric lakehouse or warehouse
  • Tables must be in the default semantic model (auto-created) or a custom one
  • Capacity must be F64 or higher for production-scale Direct Lake
  • Row count and column limits depend on SKU (F64: 300M rows per table)

OneLake Shortcuts

Shortcuts let you reference external data without copying it. They appear as folders in your lakehouse, but the data stays in its original location.

Shortcut SourceUse Case
Another OneLake lakehouseShare data between workspaces without duplication
Azure Data Lake Gen2Access existing ADLS data without migration
Amazon S3Multi-cloud: query S3 data from Fabric
Google Cloud StorageMulti-cloud: query GCS data from Fabric
DataverseAccess Dynamics 365 / Power Platform data
Performance note: Shortcuts to external sources (S3, GCS) have higher latency than native OneLake data. For frequently queried data, consider copying it into OneLake with a pipeline and using shortcuts only for rarely accessed or very large datasets.

Medallion Architecture in Practice

Here’s a concrete implementation of the Bronze → Silver → Gold pattern in Fabric:

# === BRONZE: Raw ingestion (notebook in bronze_lakehouse) ===
# Ingest raw CSV files into a Delta table as-is
df_raw = spark.read.csv("Files/landing/sales_*.csv", header=True)
df_raw.write.format("delta").mode("append").saveAsTable("raw_sales")

# === SILVER: Clean + conform (notebook in silver_lakehouse) ===
# Read from bronze via shortcut, clean, write to silver
df_bronze = spark.read.format("delta").table("bronze_lakehouse.raw_sales")
df_silver = (df_bronze
    .dropDuplicates(["transaction_id"])
    .filter("amount > 0")
    .withColumn("sale_date", df_bronze["sale_date"].cast("date"))
    .withColumnRenamed("cust_id", "customer_id")
)
df_silver.write.format("delta").mode("overwrite").saveAsTable("clean_sales")

# === GOLD: Business aggregates (notebook in gold_lakehouse) ===
df_gold = spark.sql("""
    SELECT region, sale_date, COUNT(*) as txn_count, SUM(amount) as revenue
    FROM silver_lakehouse.clean_sales
    GROUP BY region, sale_date
""")
df_gold.write.format("delta").mode("overwrite").saveAsTable("daily_revenue")
Power BI connects to Gold: Point your Direct Lake semantic model at the gold_lakehouse. Analysts get curated, aggregated data without touching Bronze or Silver layers.

Test Yourself

Q: You have a team of SQL analysts who need INSERT/UPDATE/DELETE capabilities. Should you use a lakehouse or warehouse?

Warehouse. The lakehouse SQL endpoint is read-only. Only the warehouse supports full T-SQL DML (INSERT, UPDATE, DELETE) and stored procedures.

Q: How does Direct Lake differ from Import and DirectQuery modes?

Import copies data into the Power BI model (fast but stale). DirectQuery queries the source live (fresh but slow). Direct Lake reads Parquet files directly from OneLake into memory on demand — combining import-level speed with near-real-time freshness, and no data copy.

Q: What happens when a PySpark notebook writes a Delta table to a lakehouse?

The table automatically appears in the lakehouse’s SQL endpoint (queryable via T-SQL) and the default semantic model (available for Power BI Direct Lake reports). No manual registration is needed.

Q: What is V-Order and do you need to enable it?

V-Order is a write-time optimization that sorts and compresses Parquet row groups for faster reads. It’s enabled by default in Fabric — you don’t need to do anything. It works across all Fabric engines (Spark, SQL, Power BI).

Q: When would you use a OneLake shortcut instead of copying data into a lakehouse?

Use shortcuts when you want to query external data (ADLS Gen2, S3, GCS, another workspace’s lakehouse) without duplicating it. Best for large datasets that are rarely queried or data you don’t own. For frequently queried data, copying into OneLake gives better performance.

Interview Questions

Q: Your company has an existing Azure Data Lake with 50TB of Parquet files. How would you integrate this with Fabric without moving the data?

Create a OneLake shortcut to the ADLS Gen2 account from a Fabric lakehouse. The shortcut makes the external data appear as a folder in the lakehouse. Spark notebooks and the SQL endpoint can query it directly. For frequently accessed data, create a pipeline to incrementally copy hot data into OneLake while keeping the shortcut for cold data.

Q: A Power BI report on Direct Lake mode is frequently falling back to DirectQuery. How do you troubleshoot?

Check: (1) Table row counts vs SKU limits (F64 allows 300M rows per table). (2) Unsupported data types (e.g., complex nested types). (3) DAX queries hitting features not yet supported in Direct Lake. (4) Use Performance Analyzer in Power BI Desktop to identify fallback queries. Fix by optimizing tables (aggregate, partition, or split large tables), fixing data types, or upgrading capacity SKU.

Q: How would you implement a medallion architecture in Fabric for a real-time e-commerce analytics pipeline?

Create three lakehouses: bronze_lakehouse (raw events via Eventstream), silver_lakehouse (cleaned/deduped with Spark notebooks), gold_lakehouse (business aggregates). Use Fabric pipelines to orchestrate Bronze → Silver → Gold transformations on a schedule. Point Power BI Direct Lake at Gold. For real-time metrics, use Real-Time Intelligence (KQL database) in parallel with the batch medallion flow.

Q: Explain the difference between a managed table and an unmanaged (external) table in a Fabric lakehouse.

Managed table: Fabric manages both the metadata and the data files. When you drop the table, both are deleted. Created with saveAsTable(). Unmanaged (external) table: You specify the file path. Fabric manages only the metadata. Dropping the table removes the metadata but not the files. Created with .save("Files/path") plus a separate table registration.