ML Pipelines & Feature Stores

TL;DR

ML pipelines automate the training workflow: data ingestion → validation → feature engineering → training → evaluation → registration. Feature stores centralize feature computation so training and serving use the same code, eliminating training/serving skew. Together they make ML systems reproducible and scalable.

The Big Picture

ML Pipeline architecture showing data ingestion, validation, feature engineering, training, evaluation, and model registration stages connected to a feature store
Explain Like I'm 12

Think of a car factory assembly line. Raw materials come in one end, and finished cars roll out the other. Each station does one job: stamp the metal, attach the engine, paint the body, test the brakes. If any station fails quality checks, the car doesn't ship.

An ML pipeline is the same thing, but for models. Raw data comes in, gets cleaned, gets turned into features, feeds a training algorithm, and the finished model rolls out — tested and ready.

The feature store is like a parts warehouse. Instead of every assembly line making its own screws, there's one warehouse where all lines grab the same standardized screws. That way every car has identical parts.

ML Pipeline Architecture

A production ML pipeline has these stages, each implemented as an independent, rerunnable step:

1. Data Ingestion

Pull data from sources: databases, APIs, data lakes, streaming platforms. Validate schema and freshness. If the data didn't arrive or has the wrong format, stop here — don't train on garbage.

2. Data Validation

Check data quality: missing values, unexpected distributions, schema changes, range violations. Tools like Great Expectations or TensorFlow Data Validation (TFDV) automate these checks.

import great_expectations as gx

context = gx.get_context()
batch = context.get_batch("training_data")

# Define expectations
batch.expect_column_values_to_not_be_null("user_id")
batch.expect_column_values_to_be_between("age", min_value=0, max_value=120)
batch.expect_column_values_to_be_in_set("country", ["US", "UK", "CA", "DE", "FR"])
batch.expect_table_row_count_to_be_between(min_value=10000)

# Validate - fails fast if any expectation breaks
results = batch.validate()
if not results.success:
    raise ValueError(f"Data validation failed: {results}")
Warning: Skipping data validation is the #1 cause of silent model failures. A single upstream schema change (column renamed, type changed) can make your model output random garbage — and it'll still return 200 OK to every request.

3. Feature Engineering

Transform raw data into model-ready features. This might mean aggregations (average transaction amount over 30 days), encodings (one-hot, label, target), or derived features (ratios, time-since-event). Feature stores (see below) make this step reusable.

4. Training

Train the model on processed features. Log all parameters, metrics, and the model artifact to your experiment tracking system. This step is often the most resource-intensive (GPU clusters for deep learning).

5. Evaluation & Validation

Test the trained model against a held-out set. Run automated validation gates: does it beat the current champion? Does it meet minimum metric thresholds? Does inference latency fit within the SLA?

6. Model Registration

If the model passes all gates, register it in the model registry. Tag it with metadata (training data version, metrics, commit hash). It's now ready for deployment.

Pipeline Tool Comparison

Tool Type Best For Trade-off
Kubeflow Pipelines Kubernetes-native Teams already on K8s; need GPU scheduling Complex setup; requires K8s expertise
Apache Airflow General orchestrator Teams already using Airflow for data pipelines Not ML-specific; no built-in experiment tracking
SageMaker Pipelines Managed (AWS) AWS shops wanting fully managed ML Vendor lock-in; expensive at scale
Vertex AI Pipelines Managed (GCP) GCP shops; uses Kubeflow SDK GCP-only; Kubeflow SDK learning curve
ZenML Framework-agnostic Teams wanting infrastructure portability Newer ecosystem; smaller community
Prefect / Dagster Modern orchestrators Pythonic API; good data awareness Less ML-specific tooling than Kubeflow
Tip: Don't choose a pipeline tool based on features alone. Choose based on what your team already knows. If you're running Airflow for data engineering, extend it for ML. If you're on Kubernetes, Kubeflow is natural. The best tool is the one your team will actually maintain.

Feature Stores Deep Dive

The Training/Serving Skew Problem

This is the core problem feature stores solve. During training, you compute features from a historical dataset (batch). During serving, you compute the same features in real-time from a live request. If the code is different — and it almost always is, because batch SQL and real-time Python are different languages — your model sees different inputs in production than it was trained on.

The result: degraded accuracy that's incredibly hard to debug because the model itself hasn't changed.

Feature Store Architecture

A feature store has two main components:

  • Offline store — Optimized for batch reads. Stores historical feature values for training. Backed by a data warehouse or data lake (BigQuery, S3/Parquet, Snowflake).
  • Online store — Optimized for low-latency point lookups. Stores the latest feature values for real-time serving. Backed by Redis, DynamoDB, or similar key-value stores.

Feature definitions are written once. The store materializes them to both offline and online stores, guaranteeing consistency.

from feast import Entity, Feature, FeatureView, FileSource
from feast.types import Float64, Int64

# Define a feature view
driver_stats = FeatureView(
    name="driver_stats",
    entities=[Entity(name="driver_id", join_keys=["driver_id"])],
    schema=[
        Feature(name="avg_daily_trips", dtype=Int64),
        Feature(name="avg_rating", dtype=Float64),
        Feature(name="lifetime_trip_count", dtype=Int64),
    ],
    source=FileSource(
        path="data/driver_stats.parquet",
        timestamp_field="event_timestamp",
    ),
    ttl=timedelta(days=1),  # Feature freshness
)

# At training time: get historical features
training_df = store.get_historical_features(
    entity_df=entity_df,  # driver_ids + timestamps
    features=["driver_stats:avg_daily_trips", "driver_stats:avg_rating"],
).to_df()

# At serving time: get latest features (low-latency)
online_features = store.get_online_features(
    features=["driver_stats:avg_daily_trips", "driver_stats:avg_rating"],
    entity_rows=[{"driver_id": 1001}],
).to_dict()

Point-in-Time Correctness

A subtle but critical feature: when building training data, the feature store only returns feature values that were available at the time of the event. This prevents data leakage — accidentally using future information to predict the past. Without this, your model looks great in offline evaluation but performs terribly in production.

Feature Store vs. Data Warehouse: A data warehouse stores raw and aggregated data for analytics. A feature store stores computed, versioned features optimized for ML. Think of it as a specialized layer on top of your data warehouse, focused on serving features to models with guaranteed consistency between training and serving.

CI/CD for ML Pipelines

Just like application code, ML pipelines need CI/CD. But the scope is wider — you're testing code, data, and model quality:

Stage What's Tested Example
Code Tests Feature engineering logic, data transforms Unit tests for feature functions
Data Tests Schema, distributions, freshness Great Expectations validation suite
Model Tests Metrics, latency, bias F1 > 0.85, p99 latency < 100ms
Integration Tests End-to-end pipeline execution Run full pipeline on a sample dataset
# .github/workflows/ml-pipeline-ci.yml
name: ML Pipeline CI
on:
  push:
    paths:
      - 'ml/**'
      - 'features/**'

jobs:
  test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: Install dependencies
        run: pip install -r requirements.txt
      - name: Run unit tests
        run: pytest tests/unit/ -v
      - name: Run data validation
        run: python scripts/validate_data.py --sample
      - name: Run pipeline on sample
        run: python pipelines/train.py --sample-size 1000
      - name: Check model metrics
        run: python scripts/check_metrics.py --min-f1 0.80
Tip: Run pipeline CI on a sample of the data. Full training runs can take hours and cost hundreds of dollars in GPU time. A 1% sample catches most code and data bugs in minutes. Reserve full runs for nightly builds or pre-deployment validation.

Test Yourself

What is training/serving skew and how does a feature store prevent it?

Training/serving skew occurs when features are computed differently at training time (batch SQL) vs serving time (real-time Python), causing the model to see different inputs in production. A feature store prevents this by defining features once and materializing them to both an offline store (for training) and an online store (for serving), guaranteeing the same computation logic is used in both cases.

Why is data validation important in an ML pipeline?

Data validation catches upstream data quality issues before they corrupt your model. Without it, a schema change (column renamed), missing data, or distribution shift can cause the model to silently produce bad predictions while still returning 200 OK. Validation acts as a circuit breaker: if the data doesn't meet expectations, the pipeline stops instead of training on garbage.

Compare Kubeflow Pipelines vs Apache Airflow for ML workflows.

Kubeflow Pipelines is Kubernetes-native, built specifically for ML — it has native GPU scheduling, experiment tracking integration, and ML-specific pipeline components. But it requires Kubernetes expertise. Apache Airflow is a general-purpose orchestrator that many teams already use for data pipelines. It's not ML-specific (no built-in experiment tracking) but it can orchestrate ML training jobs. Choose Kubeflow if you need K8s-level ML features. Choose Airflow if your team already runs it and you want to avoid adding another system.

What is point-in-time correctness in a feature store?

Point-in-time correctness ensures that when building training data, the feature store only returns feature values that were available at the time of the prediction event. This prevents data leakage — accidentally using future information. For example, if you're predicting fraud for a transaction at 2pm, the feature store won't use any data generated after 2pm. Without this, your model appears more accurate offline than it actually is in production.

How is CI/CD for ML different from CI/CD for traditional software?

Traditional CI/CD tests code. ML CI/CD tests three things: code (unit tests for feature engineering), data (schema validation, distribution checks, freshness), and model quality (metrics thresholds, latency SLAs, bias checks). ML CI is also more expensive — running the full pipeline on all data can take hours and cost GPU time, so teams often test on data samples first and reserve full runs for nightly or pre-deployment gates.

Interview Questions

Q: Your team has 10 models all computing the same "user_avg_transactions_30d" feature independently. What would you do?

Introduce a feature store. Define "user_avg_transactions_30d" once in the store, compute it centrally, and serve it to all 10 models from the same source. This eliminates duplication, ensures consistency (all models use the same computation), prevents training/serving skew, and makes it easy to update the feature logic in one place.

Q: How would you design a retraining pipeline that runs when data drift is detected?

1. Set up a monitoring job (e.g., Evidently) that runs on a schedule (hourly/daily) and compares production data distributions against the training data reference. 2. When drift exceeds a threshold (e.g., KS statistic > 0.1 on key features), the monitor publishes an event to a message queue or triggers an API call. 3. The event triggers the retraining pipeline (Kubeflow/Airflow) which pulls fresh data, validates it, trains a new model, and evaluates it against the champion. 4. If the new model passes validation gates, it's registered in the model registry and deployed via canary rollout. 5. A human-in-the-loop approval step is wise for the first few automated cycles.

Q: What's the difference between an offline and online feature store?

The offline store is optimized for batch reads — it stores historical feature values for training data generation. Backed by data warehouses or data lakes (S3, BigQuery). Latency: seconds to minutes. The online store is optimized for low-latency point lookups — it stores the latest feature values for real-time serving. Backed by key-value stores (Redis, DynamoDB). Latency: single-digit milliseconds. Both stores are populated from the same feature definitions, ensuring consistency.