ML Model Lifecycle

TL;DR

An ML model's journey: train (experiments, hyperparameter tuning), evaluate (offline metrics, validation sets), register (version in model registry), deploy (shadow mode, canary, blue-green), monitor (data drift, prediction drift, latency), and retrain (triggered by drift or on a schedule). The lifecycle never ends — it loops.

The Big Picture

ML Model Lifecycle: Training, Evaluation, Registry, Deployment, Monitoring, and Retraining in a continuous loop
Explain Like I'm 12

Think of a weather forecasting robot. You train it by showing it years of weather data. Then you test it: did it correctly predict rain on days it actually rained? If it scores well, you publish the forecast to the newspaper. Every day you check: is it still getting the forecasts right? If it starts predicting snow in July, you retrain it with newer data. Train → test → publish → watch → retrain. Over and over.

Training

Training is where you teach the model to find patterns in data. In production MLOps, training is not a one-off notebook — it's an automated, reproducible pipeline.

Hyperparameter Tuning

Models have knobs you set before training: learning rate, number of layers, batch size, regularization strength. Finding the best combination is hyperparameter tuning.

Strategy How It Works When to Use
Grid Search Try every combination from a predefined grid Small search space, few parameters
Random Search Sample random combinations Large search space, faster than grid
Bayesian Optimization Use previous results to guide the next trial Expensive training runs (deep learning)
import optuna

def objective(trial):
    lr = trial.suggest_float("lr", 1e-5, 1e-1, log=True)
    n_layers = trial.suggest_int("n_layers", 1, 5)
    dropout = trial.suggest_float("dropout", 0.1, 0.5)

    model = build_model(lr=lr, n_layers=n_layers, dropout=dropout)
    accuracy = train_and_evaluate(model, X_train, y_train, X_val, y_val)
    return accuracy

study = optuna.create_study(direction="maximize")
study.optimize(objective, n_trials=100)

print(f"Best params: {study.best_params}")
print(f"Best accuracy: {study.best_value:.4f}")
Tip: Always use a validation set separate from your test set for hyperparameter tuning. If you tune on the test set, you're overfitting to it. The test set should only be used once — for the final evaluation before deployment.

Evaluation

Before a model goes to production, it must pass offline evaluation. This means measuring it against a held-out test set that the model has never seen.

Key metrics depend on the task:

Task Metrics Watch out for
Classification Accuracy, Precision, Recall, F1, AUC-ROC Class imbalance — 99% accuracy is meaningless if 99% of data is one class
Regression MAE, RMSE, R² Outliers can skew RMSE — check MAE too
Ranking NDCG, MAP, MRR Position bias — users click top results regardless of quality
Warning: Offline metrics are necessary but not sufficient. A model that looks great on a test set may fail in production due to data drift, feature engineering bugs, or latency issues. Always validate with online evaluation (A/B tests, shadow mode) before full rollout.

Model Validation Gates

Automated validation gates prevent bad models from reaching production. Your pipeline should check:

  • Metrics exceed minimum thresholds (e.g., F1 > 0.85)
  • New model outperforms the current production model (champion/challenger)
  • No data leakage detected
  • Inference latency is within SLA (e.g., p99 < 100ms)
  • Bias/fairness checks pass for protected attributes

Deployment Strategies

Deploying a model is riskier than deploying code because a bad model produces wrong answers, not error pages. Use progressive rollout strategies:

Shadow Mode

The new model runs alongside the current one but its predictions are logged, not served to users. You compare both models' predictions on real traffic without any risk. This is the safest first step for high-stakes models (financial, medical).

Canary Deployment

Route a small percentage of traffic (e.g., 5%) to the new model. Monitor metrics. If healthy, gradually increase (10%, 25%, 50%, 100%). If metrics degrade, roll back immediately — only 5% of users were affected.

Blue-Green Deployment

Run two identical environments. Blue serves the current model. Green gets the new model. Switch traffic all at once after testing. Instant rollback by switching back to Blue.

A/B Testing

Split traffic between the old model (control) and new model (treatment). Measure a business metric (revenue, conversion, engagement), not just ML metrics. Run until statistically significant. This is the gold standard for validating that a better model actually improves the product.

Deployment progression: Most teams follow this order: Shadow Mode (validate safely) → Canary (limit blast radius) → A/B Test (measure business impact) → Full Rollout. Skip stages only when the risk is low.

Monitoring in Production

Models don't crash — they silently degrade. That makes monitoring critical. Here's what to watch:

Data Drift Detection

Compare the distribution of incoming features against the training data. Statistical tests like Kolmogorov-Smirnov (for numerical features) and chi-squared (for categorical features) can detect when distributions diverge.

from evidently.test_suite import TestSuite
from evidently.tests import TestColumnDrift

# Test for drift on key features
suite = TestSuite(tests=[
    TestColumnDrift(column_name="transaction_amount"),
    TestColumnDrift(column_name="user_age"),
    TestColumnDrift(column_name="device_type"),
])
suite.run(reference_data=train_df, current_data=prod_df)

# Check if any test failed
if not suite.as_dict()["summary"]["all_passed"]:
    trigger_retraining_pipeline()

Performance Monitoring

When ground truth arrives (which can be delayed by hours, days, or weeks), compute actual accuracy. Set up dashboards tracking:

  • Accuracy / F1 / AUC over time (rolling windows)
  • Prediction distribution (are predictions shifting?)
  • Feature importance stability
  • Latency (p50, p95, p99)
  • Throughput (requests per second)
Warning: Ground truth delay is a real problem. For fraud detection, you might not know if a transaction was actually fraudulent for 30-90 days. In the meantime, use proxy metrics and data drift detection as early warning signals.

Retraining Strategies

When should you retrain? There are three triggers:

  • Scheduled retraining — Retrain on a fixed cadence (daily, weekly, monthly). Simple and predictable. Use when data arrives regularly.
  • Drift-triggered retraining — Retrain when monitoring detects significant data or prediction drift. More efficient — you only retrain when needed.
  • Performance-triggered retraining — Retrain when actual metrics drop below a threshold. Most accurate trigger, but requires ground truth labels.
Tip: Start with scheduled retraining (weekly or monthly). It's simple and catches most drift. Add drift-triggered retraining later when you need faster response. Always keep a human in the loop for the first few retraining cycles to validate the new model before promoting to production.

Test Yourself

What is shadow mode deployment and when should you use it?

Shadow mode runs the new model alongside the current production model, but the new model's predictions are only logged — never served to users. You compare both models' outputs on real traffic without any user impact. Use it for high-stakes applications (finance, healthcare) where a bad prediction has serious consequences, or when you're deploying a fundamentally different model architecture.

Why are offline metrics (test set accuracy) not sufficient for production?

Offline metrics test the model on a static, curated dataset. In production, data is messy, distributions shift over time (drift), features may be computed differently (training/serving skew), and latency matters. A model with 95% test accuracy might have 80% accuracy on real traffic if the real-world distribution differs from the test set. That's why you need online evaluation (A/B tests, shadow mode) and continuous monitoring.

Explain the champion/challenger pattern.

The champion is the current production model. The challenger is a newly trained model that's been evaluated offline. The challenger must beat the champion on key metrics to be promoted. This is often automated in validation gates: if the challenger's F1 score is higher than the champion's (with statistical significance), it gets promoted to production via canary deployment.

What's the difference between data drift and concept drift?

Data drift (covariate shift) means the input feature distributions change. Example: a new user demographic starts using your product. Concept drift means the relationship between inputs and outputs changes. Example: what constitutes "spam" evolves as spammers adapt. Both degrade model performance, but concept drift is harder to detect because the input distribution may look the same while the correct labels change.

Name three triggers for model retraining.

(1) Scheduled — retrain on a fixed cadence (weekly, monthly). (2) Drift-triggered — retrain when monitoring detects statistically significant data or prediction drift. (3) Performance-triggered — retrain when actual metrics (accuracy, F1) drop below a threshold, requiring ground truth labels.

Interview Questions

Q: How would you deploy a new ML model to production with minimal risk?

Start with shadow mode — run the new model alongside the current one, logging predictions without serving them. Compare outputs. If the new model performs well, move to a canary deployment — route 5% of traffic to the new model, monitor closely. If metrics hold, gradually increase traffic. For high-stakes decisions, run a full A/B test measuring business metrics, not just ML metrics. Only after statistical significance, do a full rollout.

Q: Your model accuracy dropped 10% overnight. Walk me through your debugging process.

1. Check data quality first — is the input data pipeline broken? Look for missing values, schema changes, or upstream data issues. 2. Check for data drift — compare recent input distributions against training data. Has a new category appeared? Has a feature's range shifted? 3. Check feature engineering — is the serving code computing features differently than training code (training/serving skew)? 4. Check infrastructure — timeout issues, model loading errors, or memory pressure causing predictions to fail silently. 5. Check for concept drift — has the real-world relationship between features and labels changed? If so, retrain on recent data.

Q: How do you handle ground truth delay when monitoring model performance?

When ground truth labels are delayed (e.g., fraud labels arrive 30-90 days later), use proxy metrics as early warning signals: prediction distribution shifts, data drift scores, feature importance changes, and user behavior metrics (click-through rates, engagement). Set up data drift monitoring to detect when inputs deviate from training distributions — this doesn't require labels. When labels eventually arrive, compute actual metrics and validate that your proxy alerts were calibrated correctly.