Core Concepts of MLOps

TL;DR

MLOps has 7 building blocks: Experiment Tracking (log every training run), Data & Model Versioning (Git for datasets and models), ML Pipelines (automate the training workflow), Model Registry (promote models through stages), Model Serving (deploy as an API), Monitoring (watch for drift and degradation), and Feature Stores (reuse features across models). Master these and you can productionize any ML system.

The Big Picture

These seven concepts form the MLOps stack. Data flows up through feature engineering and training, models are versioned and registered, then deployed and monitored — with feedback loops driving retraining:

MLOps concept map showing the 7 building blocks: Experiment Tracking, Versioning, Pipelines, Registry, Serving, Monitoring, and Feature Stores
Explain Like I'm 12

Imagine you're baking cookies and trying to find the perfect recipe.

Experiment Tracking is your lab notebook. Every batch you bake, you write down the exact ingredients and oven temperature, then rate how they taste. Without the notebook, you'd forget which batch was the best.

Versioning is like numbering your recipe drafts: Recipe v1 (too salty), Recipe v2 (too sweet), Recipe v3 (perfect!). You keep all versions so you can go back.

Pipelines are your kitchen assembly line. Instead of doing everything by hand, you set up robots: one mixes ingredients, one shapes the dough, one runs the oven, one taste-tests. Push a button and cookies come out.

Model Registry is the display case. Your best cookies go from "testing" to "approved" to "on sale." Customers only eat the approved ones.

Serving is the bakery counter. Customers walk up and order cookies. You need to serve them fast without the counter breaking.

Monitoring is checking customer reviews. If people start complaining the cookies taste different, you investigate and rebake.

Feature Store is your pre-prepped ingredients shelf. Instead of chopping chocolate chips fresh for every recipe, you keep a shelf of pre-chopped ingredients that any recipe can grab.

Cheat Sheet

Concept Tool Example Plain English
Experiment Tracking MLflow, W&B Log every training run's hyperparameters, metrics, and artifacts
Versioning DVC, LakeFS Version datasets and models like you version code with Git
ML Pipelines Kubeflow, Airflow Chain data prep → training → evaluation into automated workflows
Model Registry MLflow Registry, SageMaker Store model versions and promote them through staging → production
Model Serving Seldon, BentoML, TorchServe Deploy models as REST/gRPC APIs for real-time or batch predictions
Monitoring Evidently, Whylabs, Arize Detect data drift, prediction drift, and performance degradation
Feature Store Feast, Tecton, Databricks FS Centralize features so teams share and reuse them across models

The 7 Building Blocks

1. Experiment Tracking

Data scientists don't write code once and ship it. They run hundreds of experiments — trying different algorithms, tuning hyperparameters, adding or removing features. Without tracking, you end up with a notebook called model_final_v3_ACTUALLY_FINAL.ipynb and no idea which parameters produced the best model.

Experiment tracking tools log every run's parameters (learning rate, batch size, epochs), metrics (accuracy, F1, loss), and artifacts (the trained model file, plots, confusion matrices). You can compare runs side by side and reproduce any result.

import mlflow

mlflow.set_experiment("fraud-detection")

with mlflow.start_run():
    # Log hyperparameters
    mlflow.log_param("learning_rate", 0.001)
    mlflow.log_param("n_estimators", 100)
    mlflow.log_param("max_depth", 6)

    # Train model
    model = train_model(X_train, y_train, lr=0.001, n_estimators=100)

    # Log metrics
    mlflow.log_metric("accuracy", 0.94)
    mlflow.log_metric("f1_score", 0.89)
    mlflow.log_metric("auc_roc", 0.97)

    # Log the model artifact
    mlflow.sklearn.log_model(model, "model")
Tip: Track experiments from day one. Retrofitting tracking onto an existing project is painful. Even if it's just you on the project, your future self will thank you when someone asks "how did you train that model in production?"

2. Data & Model Versioning

Git versions your code, but datasets and model files are too large for Git (think gigabytes of CSVs, images, or model weights). DVC (Data Version Control) solves this by storing a lightweight pointer in Git while the actual data lives in cloud storage (S3, GCS, Azure Blob).

This gives you the full Git experience for data: branches, diffs, tags, rollbacks. When someone asks "what data did we train model v2.3 on?" you can answer in one command.

# Initialize DVC in your Git repo
dvc init

# Track a large dataset
dvc add data/training_set.parquet

# Git tracks the .dvc pointer file
git add data/training_set.parquet.dvc .gitignore
git commit -m "Add training dataset v1"

# Push data to remote storage
dvc push

# Later: switch to a different data version
git checkout v2.0
dvc checkout
Model versioning is often handled by the Model Registry (see below) rather than DVC. DVC excels at dataset versioning. The model registry excels at model lifecycle management (staging → production).

3. ML Pipelines

An ML pipeline chains together the steps of your ML workflow: data ingestion → data validation → feature engineering → training → evaluation → registration. Each step is a self-contained unit with defined inputs and outputs.

Why not just run a script? Because pipelines give you reproducibility (every run is identical), caching (skip steps that haven't changed), parallelism (run independent steps simultaneously), and scheduling (trigger retraining daily or on data arrival).

Tools range from code-first (Kubeflow Pipelines, ZenML) to orchestrators you already know (Airflow, Prefect) to managed services (SageMaker Pipelines, Vertex AI Pipelines).

Tip: If your team already uses Apache Airflow for data pipelines, you can use it for ML pipelines too. Don't adopt a new orchestrator just because it has "ML" in the name.

Deep dive: ML Pipelines & Feature Stores

4. Model Registry

A model registry is a central catalog of trained models. Think of it like a Docker registry, but for ML models. Each model has versions, and each version moves through stages: NoneStagingProductionArchived.

The registry stores the model artifact, its lineage (which experiment and data produced it), and metadata (metrics, tags, description). When you deploy to production, you pull from the registry — never from someone's laptop.

import mlflow

# Register a model from an experiment run
result = mlflow.register_model(
    model_uri="runs:/abc123/model",
    name="fraud-detector"
)

# Promote to production
client = mlflow.tracking.MlflowClient()
client.transition_model_version_stage(
    name="fraud-detector",
    version=3,
    stage="Production"
)
Warning: Never deploy a model that hasn't been registered. If you deploy from a notebook or local file, you lose traceability. When the model misbehaves in production, you won't know which code, data, or parameters produced it.

5. Model Serving

Model serving is how predictions reach end users. There are two main patterns:

  • Real-time (online) serving — Model runs behind a REST or gRPC API. User sends a request, gets a prediction back in milliseconds. Example: fraud detection on a payment, content recommendations on page load.
  • Batch serving — Model processes a large dataset on a schedule (daily, hourly). Results are written to a table or file. Example: scoring all customers for churn risk overnight.

For real-time serving, you need to think about latency (how fast), throughput (how many concurrent requests), autoscaling (handle traffic spikes), and A/B testing (route a percentage of traffic to a new model version).

Edge serving is a third pattern where the model runs on the device itself (phone, IoT sensor, browser). This eliminates network latency but limits model size and update frequency.

6. Monitoring

Traditional monitoring watches CPU, memory, and error rates. ML monitoring adds a layer: is the model still making good predictions?

Three things to monitor:

  • Data drift — The input data distribution has changed from what the model was trained on. Example: a new customer segment starts using your product.
  • Prediction drift — The model's output distribution is changing, even if the input looks similar. The model may be "confused" by subtle shifts.
  • Performance degradation — When you have ground truth labels (often delayed), you can directly measure accuracy, precision, recall. If they drop below a threshold, trigger retraining.
from evidently.report import Report
from evidently.metric_preset import DataDriftPreset

# Compare training data vs recent production data
report = Report(metrics=[DataDriftPreset()])
report.run(
    reference_data=training_df,
    current_data=production_df
)

# Save or display the drift report
report.save_html("drift_report.html")
Warning: A model can look healthy on traditional metrics (low latency, no errors) while silently making terrible predictions. Always monitor ML-specific metrics alongside infrastructure metrics.

Deep dive: ML Model Lifecycle

7. Feature Stores

Features are the processed inputs your model needs. A "user_avg_transaction_30d" feature takes raw transaction data and computes a 30-day rolling average. This feature might be useful for fraud detection, credit scoring, and churn prediction.

Without a feature store, each team writes its own feature computation code, leading to inconsistency (training uses one calculation, serving uses another) and duplication (five teams compute the same feature five different ways).

A feature store provides:

  • Single source of truth — One definition per feature, used by both training and serving
  • Online + offline serving — Low-latency reads for real-time predictions, batch reads for training
  • Point-in-time correctness — Prevents data leakage by only using features available at prediction time
  • Feature discovery — A catalog so teams can search and reuse existing features
Tip: You don't need a feature store for your first model. Start with features computed in your training script. Adopt a feature store when you have multiple models sharing features, or when training/serving skew becomes a problem.

Deep dive: ML Pipelines & Feature Stores

MLOps Maturity Levels

Google's MLOps maturity model defines three levels. Most teams start at Level 0 and work their way up:

Level Name What it looks like
0 Manual Notebooks, manual training, manual deployment, no monitoring. "It works on my laptop."
1 ML Pipeline Automation Automated training pipelines, experiment tracking, model registry. Training is reproducible. Deployment may still be manual.
2 CI/CD for ML Full CI/CD: automated testing of data, models, and code. Automated deployment with canary rollouts. Monitoring triggers retraining. The entire loop is automated.

Test Yourself

What is the difference between experiment tracking and a model registry?

Experiment tracking logs every training run — parameters, metrics, and artifacts. It answers "what did I try and how did it perform?" A model registry stores approved model versions and manages their lifecycle (staging → production). It answers "which model version is currently in production?" They're complementary: experiments produce candidates, the registry manages winners.

Why can't you just use Git for dataset versioning?

Datasets are often gigabytes or terabytes — far too large for Git, which stores every version of every file in the repo history. Tools like DVC solve this by storing a small pointer file in Git while the actual data lives in cloud storage (S3, GCS). This gives you Git-like versioning (branches, tags, diffs) without bloating your repository.

What are the three types of drift you should monitor in production ML?

(1) Data drift — input data distribution shifts from the training distribution. (2) Prediction drift — the model's output distribution changes. (3) Performance degradation — accuracy/precision/recall drops when measured against ground truth labels. All three indicate the model may need retraining.

What problem does a feature store solve?

Feature stores solve training/serving skew (features computed differently at training time vs serving time) and feature duplication (multiple teams computing the same feature independently). They provide a single source of truth for feature definitions, with both online (low-latency) and offline (batch) access patterns.

Describe Google's three MLOps maturity levels.

Level 0 (Manual): Everything is manual — notebooks, hand-deployed models, no monitoring. Level 1 (ML Pipeline Automation): Automated training pipelines, experiment tracking, and model registry. Training is reproducible but deployment may be manual. Level 2 (CI/CD for ML): Full automation including CI/CD for data, model, and code. Automated deployment with monitoring that triggers retraining. The entire loop runs without human intervention.