Data Pipelines & Dataflows in Microsoft Fabric

TL;DR

Fabric has two ETL tools: Pipelines (code-first orchestration based on Azure Data Factory — copy data, run notebooks, chain activities) and Dataflows Gen2 (Power Query Online for no-code ETL that lands data into lakehouses). Use pipelines for orchestration and scale, dataflows for citizen-developer self-service.

Explain Like I'm 12

A pipeline is like an assembly line in a factory. First, the robot arm picks up raw materials (Copy Activity), then the paint machine transforms them (Notebook), then the quality checker verifies everything (If Condition), and finally the packing machine delivers the finished product (Write to lakehouse). You set the schedule, and it runs automatically every day. A Dataflow Gen2 is like a simpler version — more like a kitchen blender. You throw ingredients in (connect data), press buttons to chop and mix (Power Query transforms), and pour the result into a bowl (lakehouse table).

Fabric pipeline flow: Data Sources to Copy Activity to Bronze Lakehouse to Notebook transform to Gold Lakehouse to Power BI reports, with scheduling and monitoring

Pipelines Overview

Fabric pipelines are the orchestration engine, based on Azure Data Factory (ADF). If you’ve used ADF before, the pipeline designer will look familiar — same activities, same drag-and-drop UI, but now integrated into Fabric.

Key Activities

ActivityWhat It DoesWhen to Use
Copy DataMove data from source to destinationIngest from 100+ connectors into OneLake
NotebookRun a Spark notebookComplex transformations with PySpark
DataflowRun a Dataflow Gen2No-code transformations in a pipeline
Stored ProcedureExecute a warehouse stored procT-SQL-based transformations
For EachLoop over a collectionProcess multiple files or tables
If ConditionBranch logic (if/else)Skip steps based on data conditions
Set VariableSet a pipeline variableStore values for later activities
WebCall a REST APITrigger external systems or notifications
WaitPause for a specified durationRate limiting or waiting for external process
Parameterize everything: Use pipeline parameters for source paths, table names, and dates. This lets you reuse the same pipeline for multiple datasets by changing parameters at runtime.

Copy Activity Deep Dive

The Copy Activity is the workhorse for data ingestion. It connects to 100+ sources and writes to Fabric lakehouses, warehouses, or KQL databases.

Common Source → Destination Patterns

SourceDestinationPattern
Azure SQL DatabaseLakehouse (Delta table)Full or incremental load
REST API (JSON)Lakehouse (Files)Land raw JSON, process with Spark
On-prem SQL ServerLakehouseVia On-Premises Data Gateway
Amazon S3 / ADLS Gen2Lakehouse (Files)Cloud-to-cloud file copy
SharePoint / ExcelLakehouse (Delta table)Business user data ingestion
Incremental loads: Use a watermark column (e.g., last_modified_date) to only copy new/changed rows on each run. Store the watermark in a pipeline variable or a control table.
-- Example: Watermark-based incremental pattern
-- Control table tracks the last loaded timestamp
CREATE TABLE dbo.watermark (
    table_name   VARCHAR(100),
    last_loaded  DATETIME2
);

-- Pipeline Copy Activity query (parameterized):
-- SELECT * FROM source_table
-- WHERE modified_date > '@{activity('Lookup').output.firstRow.last_loaded}'
-- AND modified_date <= '@{pipeline().parameters.current_time}'

Dataflows Gen2

Dataflows Gen2 bring Power Query Online into Fabric. If you’ve used Power Query in Excel or Power BI Desktop, you already know the interface. The key difference: Dataflows Gen2 write output to Fabric destinations (lakehouses, warehouses) instead of just loading into a Power BI model.

Dataflow Gen2 vs Gen1

FeatureDataflow Gen1Dataflow Gen2
Output destinationPower BI dataset onlyLakehouse, warehouse, KQL database
Compute engineMashup engineEnhanced compute (faster for large data)
StagingOptionalOneLake staging (auto, can be disabled)
Incremental refreshYesYes (improved)
Use in pipelinesLimitedYes (as a pipeline activity)
Best use case for Dataflows Gen2: Self-service ETL by business users who know Power Query but don’t write code. They connect to a SharePoint list, clean data with the visual editor, and land it as a Delta table in a lakehouse — no Spark required.

Orchestration Patterns

Pattern 1: Simple ETL

Copy → Transform → Load

-- Pipeline structure:
-- 1. Copy Activity: Source DB → Bronze lakehouse (raw_sales table)
-- 2. Notebook Activity: Bronze → Silver (clean, deduplicate)
-- 3. Notebook Activity: Silver → Gold (aggregate for reporting)
-- 4. (Optional) Refresh semantic model for Power BI

Pattern 2: Dynamic Pipeline with ForEach

Process multiple tables dynamically:

-- Pipeline structure:
-- 1. Lookup Activity: Read table list from control table
--    SELECT table_name, source_schema FROM dbo.etl_config WHERE is_active = 1
-- 2. ForEach Activity: Loop over lookup results
--    2a. Copy Activity (parameterized): Copy source_table → lakehouse
--    2b. Notebook Activity (parameterized): Transform table
-- 3. On completion: Send Teams notification via Web Activity
Concurrency limits: ForEach activities have a configurable batch count (default 20, max 50). If you’re processing 100 tables, they run in batches of 20. Set isSequential: true if order matters.

Pattern 3: Conditional Branching

-- Pipeline structure:
-- 1. Lookup Activity: Check if new data exists
--    SELECT COUNT(*) as row_count FROM staging WHERE processed = 0
-- 2. If Condition: row_count > 0
--    True path: Run Notebook → Update watermark
--    False path: Log "no new data" → Skip processing

Scheduling & Monitoring

Every pipeline can be triggered in three ways:

Trigger TypeHow It WorksUse Case
ScheduleCron-like schedule (hourly, daily, custom)Regular batch loads
ManualRun on demand from the UI or APIAd-hoc loads, testing
Event-basedTrigger on file arrival in OneLakeReact to new data landing

Monitoring

Fabric provides a Monitoring Hub that shows all pipeline runs, notebook runs, dataflow refreshes, and Spark job history across the workspace. Key metrics to watch:

  • Duration — track trends; sudden increases signal problems
  • CU consumption — monitor capacity usage to avoid throttling
  • Rows read/written — verify data volumes match expectations
  • Error details — drill into failed activities for stack traces
Alerts: Set up alerts on pipeline failures via the Fabric admin portal or by adding a Web Activity at the pipeline’s failure path that calls a Teams webhook or sends an email.

Fabric Pipelines vs ADF vs Airflow

FeatureFabric PipelinesAzure Data FactoryApache Airflow
InterfaceDrag-and-drop UIDrag-and-drop UIPython code (DAGs)
ComputeShared Fabric capacity (CUs)Integration RuntimeWorkers (Celery, K8s)
Native SparkFabric notebooksHDInsight / DatabricksSparkSubmitOperator
Native Power BIBuilt-in refreshNoNo
Cost modelIncluded in Fabric capacityPer-activity + DIU hoursInfrastructure cost
Event triggersOneLake file eventsBlob storage eventsSensors (S3, GCS, etc.)
Open sourceNoNoYes (Apache 2.0)
Best forMicrosoft-first teams in FabricAzure-native ETLCloud-agnostic orchestration
Migration path: If you have existing ADF pipelines, they can be migrated to Fabric with minimal changes — the activity model is nearly identical. The main difference is that Fabric pipelines use Fabric capacity (CUs) instead of ADF’s separate billing.

Test Yourself

Q: What is the difference between a Fabric pipeline and a Dataflow Gen2?

Pipeline: an orchestration engine (like ADF) that chains activities — copy data, run notebooks, call APIs, branch with if/else, loop with ForEach. Dataflow Gen2: a Power Query Online mashup for no-code/low-code ETL. Pipelines can call Dataflows as an activity. Think of pipelines as the “factory floor” and dataflows as one “machine” on that floor.

Q: How do you implement an incremental load in a Fabric pipeline?

Use a watermark pattern: (1) Store the last-loaded timestamp in a control table. (2) Lookup Activity reads the watermark. (3) Copy Activity queries the source with a WHERE clause filtering rows after the watermark. (4) After success, update the watermark to the current timestamp.

Q: When would you use a notebook activity inside a pipeline instead of a Dataflow Gen2?

Use a notebook when you need: complex PySpark transformations, machine learning, custom Python libraries, processing semi-structured data (nested JSON, XML), or joining data from multiple lakehouses. Use Dataflow Gen2 when the transformation is simple enough for Power Query’s visual interface and the user doesn’t write code.

Q: What is the maximum concurrency for a ForEach activity in Fabric pipelines?

50 (configurable via the batchCount property). Default is 20. If you need sequential execution, set isSequential: true. Items beyond the batch count wait for a slot to open.

Q: Name three ways to trigger a Fabric pipeline.

(1) Schedule trigger — cron-like (hourly, daily, custom). (2) Manual trigger — run on demand from the UI or REST API. (3) Event-based trigger — fires when a file arrives in OneLake.

Interview Questions

Q: Your company migrates from Azure Data Factory to Fabric pipelines. What changes and what stays the same?

Same: Activity model (Copy, ForEach, If, etc.), visual designer, parameterization, linked services concept (now “connections”). Changes: Billing moves from per-activity + DIU hours to shared Fabric capacity (CUs). Integration Runtimes are replaced by Fabric’s built-in compute. Destinations shift from ADLS/Synapse to OneLake lakehouses/warehouses. Monitoring moves to Fabric’s Monitoring Hub.

Q: Design a pipeline that ingests data from 50 source tables into a bronze lakehouse, with error handling and notifications.

Pipeline: (1) Lookup Activity reads active tables from a control table. (2) ForEach Activity (batch=20) loops over the list. Inside ForEach: (a) Copy Activity (parameterized with table name) copies source → bronze lakehouse. (b) On success: Stored Procedure updates watermark in control table. (c) On failure: Set Variable appends error to a failures list. (3) After ForEach: If Condition checks if any failures occurred. If yes: Web Activity sends a Teams webhook with the error summary.

Q: A dataflow Gen2 is running too slowly on large datasets. How would you troubleshoot and optimize?

(1) Check if staging is enabled in the dataflow (OneLake staging uses Spark for faster processing). (2) Reduce the number of columns selected early (column pruning). (3) Filter rows as early as possible (push-down to source via query folding). (4) Avoid complex M functions that break query folding. (5) For very large data, consider switching to a Spark notebook instead — Dataflows are designed for small-to-medium datasets.

Q: How would you handle a scenario where a pipeline needs to process data from both a real-time stream and a daily batch source?

Use a lambda architecture in Fabric: (1) Real-time path: Eventstream ingests streaming data into a KQL database for real-time dashboards. (2) Batch path: A scheduled pipeline runs daily, copies data into the bronze lakehouse, transforms through Silver → Gold. (3) Both paths write to OneLake. Power BI reports use Direct Lake on the Gold lakehouse for batch and a KQL visual for real-time. Alternatively, use Spark Structured Streaming in a notebook for near-real-time processing into Delta tables.