MLOps Interview Questions

TL;DR

30+ MLOps interview questions with hidden answers, organized by topic. Click "Show Answer" to reveal. Perfect for a quick 15-minute revision before an ML engineering interview.

Short on time? Focus on MLOps Fundamentals, Model Deployment, and Scenario Questions — these come up in almost every interview.

MLOps Fundamentals

Q: What is MLOps and how does it differ from DevOps?

MLOps applies DevOps principles (CI/CD, automation, monitoring) to machine learning systems. The key difference is that ML systems have three artifacts to manage (code + data + model), not just code. MLOps adds experiment tracking, data versioning, model registries, drift monitoring, and feature stores — none of which exist in traditional DevOps. Models also decay over time (drift), requiring continuous retraining, whereas software doesn't get worse on its own.

Q: Describe Google's three MLOps maturity levels.

Level 0 (Manual): Data scientists work in notebooks, manually deploy models, no automation or monitoring. Level 1 (ML Pipeline Automation): Automated training pipelines, experiment tracking, model registry. Training is reproducible. Level 2 (CI/CD for ML): Full CI/CD covering code, data, and models. Automated testing, deployment (canary/blue-green), monitoring with drift-triggered retraining. The entire loop runs autonomously.

Q: Why do most ML projects fail to reach production?

The "last mile" gap: a notebook prototype is far from a production service. Common blockers include lack of reproducibility (can't recreate results), training/serving skew (features computed differently), no monitoring for drift, poor data quality in production vs clean datasets, lack of infrastructure for serving and scaling, and organizational silos between data science and engineering teams. MLOps addresses all of these systematically.

Q: What are the key differences between ML systems and traditional software systems?

ML systems differ in: (1) Versioning complexity — must version code, data, AND models. (2) Testing — need to test data quality and model quality, not just code logic. (3) Decay — models degrade over time due to drift; software doesn't. (4) Reproducibility — training involves randomness; same code + data can produce different models. (5) Resource needs — training requires GPUs, large datasets, and hours/days of compute. (6) Debugging — "wrong answer" is harder to debug than a stack trace.

Q: What is technical debt specific to ML systems?

Google's landmark paper "Hidden Technical Debt in ML Systems" identifies ML-specific debt: data dependencies (unstable upstream data), feedback loops (model predictions influence future training data), entanglement (changing one feature affects all others — CACE: Changing Anything Changes Everything), pipeline jungles (complex data preprocessing), dead features (unused features nobody removes), and configuration debt (hyperparameters, thresholds, feature lists managed ad-hoc).

Want deeper coverage? See MLOps Overview and Core Concepts.

Experiment Tracking & Versioning

Q: Why is experiment tracking important and what should you log?

Experiment tracking provides reproducibility and comparability. Without it, you can't recreate results or know which experiment produced the production model. Log: parameters (learning rate, batch size, model architecture), metrics (accuracy, F1, loss curves), artifacts (model file, plots, confusion matrix), data version (which dataset was used), code version (git commit), and environment (Python version, dependencies). Tools: MLflow, Weights & Biases, Neptune.

Q: How do you version datasets that are too large for Git?

Use DVC (Data Version Control). DVC stores lightweight pointer files (.dvc) in Git while the actual data lives in cloud storage (S3, GCS, Azure Blob). dvc add data/ creates the pointer, git commit versions it, dvc push uploads the data. git checkout v2.0 && dvc checkout restores the exact dataset for any historical version. Alternative: LakeFS provides Git-like branching for data lakes.

Q: What is the difference between a model registry and experiment tracking?

Experiment tracking logs every training run (parameters, metrics, artifacts) — it answers "what did I try?" A model registry catalogs approved model versions and manages their lifecycle through stages (None → Staging → Production → Archived) — it answers "which model is in production?" Experiments produce many candidates; the registry promotes the winners. They're complementary, not interchangeable.

Q: How do you ensure reproducibility in ML experiments?

Version all three inputs: code (git commit hash), data (DVC version or data snapshot ID), and environment (Docker image or requirements.txt hash). Set random seeds (Python, NumPy, framework-specific). Log all hyperparameters. Use deterministic operations where possible (note: some GPU operations are non-deterministic). Store the trained model artifact with its lineage (link to experiment run, data version, code commit).

Pipelines & Feature Stores

Q: What are the stages of a typical ML training pipeline?

A typical ML pipeline: (1) Data Ingestion — pull from sources, validate schema. (2) Data Validation — check quality, distributions, freshness (Great Expectations, TFDV). (3) Feature Engineering — transform raw data into model features. (4) Training — train model, log experiment. (5) Evaluation — test on held-out set, run validation gates. (6) Registration — if gates pass, register in model registry. Each step is independent, cached, and rerunnable.

Q: What is training/serving skew and how do you prevent it?

Training/serving skew occurs when features are computed differently during training (batch, SQL) versus serving (real-time, Python). The model sees different inputs in production than during training, degrading accuracy. Prevent it with a feature store that defines features once and materializes them to both offline (training) and online (serving) stores. Alternatively, ensure the same code computes features in both paths (shared library, containerized feature service).

Q: Compare Kubeflow Pipelines, Apache Airflow, and SageMaker Pipelines for ML workflows.

Kubeflow Pipelines: Kubernetes-native, built for ML, has GPU scheduling, integrates with experiment tracking. Requires K8s expertise. Apache Airflow: General-purpose orchestrator, widely adopted for data pipelines, not ML-specific (no built-in tracking), great if you already use it. SageMaker Pipelines: Fully managed on AWS, integrated with SageMaker training/hosting, easy to start but creates vendor lock-in. Choose based on existing infrastructure and team expertise.

Q: What is point-in-time correctness in a feature store?

When building training data, point-in-time correctness ensures features are as-of the event timestamp, not as-of now. If you're training a fraud model on a transaction from January 15, the feature store returns the user's average transaction amount as it was on January 15 — not today's value. Without this, you get data leakage: using future information that wasn't available at prediction time, making offline metrics artificially high.

Q: What is the difference between the online and offline store in a feature store?

The offline store serves batch workloads (training data generation). It stores historical feature values, backed by data warehouses (BigQuery, Snowflake, S3+Parquet). Latency: seconds to minutes. The online store serves real-time predictions. It stores the latest feature value per entity, backed by key-value stores (Redis, DynamoDB). Latency: single-digit milliseconds. Both are populated from the same feature definitions for consistency.

Deeper coverage: ML Pipelines & Feature Stores

Model Deployment

Q: What are the main deployment strategies for ML models?

Four strategies: (1) Shadow mode — new model runs alongside production, predictions logged but not served. Safest for validation. (2) Canary deployment — route a small % of traffic (5%) to the new model, monitor, gradually increase. (3) Blue-green — two identical environments, switch traffic all at once, instant rollback. (4) A/B testing — split traffic, measure business metrics (not just ML metrics), run until statistically significant. Typical progression: Shadow → Canary → A/B → Full rollout.

Q: What's the difference between real-time and batch model serving?

Real-time (online) serving: Model behind a REST/gRPC API, returns predictions in milliseconds. Use for: fraud detection at transaction time, search ranking, chatbot responses. Requires low latency and autoscaling. Batch serving: Model processes a large dataset on a schedule (hourly, daily). Results written to a table. Use for: nightly customer churn scoring, recommendation precomputation. Simpler infrastructure, no latency requirements, but results are stale between runs.

Q: How do you do A/B testing with ML models, and what metrics do you measure?

Split users randomly into control (current model) and treatment (new model). Run both simultaneously. Measure business metrics (revenue per user, conversion rate, engagement) alongside ML metrics (accuracy, latency). Use statistical significance testing (typically p < 0.05) to determine if the treatment is genuinely better. Important: track both guardrail metrics (metrics that must not degrade, like page load time) and success metrics (metrics you're trying to improve).

Q: How do you containerize an ML model for deployment?

Package the model in a Docker container with: the trained model artifact (pickle, ONNX, or framework-native format), a serving framework (FastAPI, Flask, BentoML, TorchServe), all dependencies (requirements.txt or conda.yml), and a health check endpoint. Use multi-stage builds to keep images small (build stage installs build tools, final stage only has runtime). Serve via Kubernetes with autoscaling. Example flow: mlflow models build-docker -m models:/fraud-detector/Production -n fraud-detector:latest.

Q: What is model serving latency optimization?

Strategies to reduce inference latency: (1) Model optimization — quantization (FP32 to INT8), pruning, distillation (train a smaller model to mimic a larger one). (2) Runtime optimization — use ONNX Runtime or TensorRT instead of native framework inference. (3) Batching — group incoming requests and process together (especially for GPU). (4) Caching — cache predictions for frequently seen inputs. (5) Feature precomputation — use a feature store's online store instead of computing features at request time.

Deeper coverage: ML Model Lifecycle

Model Monitoring

Q: What types of drift should you monitor for production ML models?

Three types: (1) Data drift (covariate shift) — input feature distributions change from training data. Detected with statistical tests (KS, chi-squared, PSI). (2) Concept drift — the relationship between inputs and outputs changes (what "spam" looks like evolves). Harder to detect; requires ground truth labels. (3) Prediction drift — model output distribution shifts. Can indicate either data or concept drift. Monitor all three, plus infrastructure metrics (latency, errors, throughput).

Q: How do you monitor a model when ground truth labels are delayed?

Use proxy metrics as early warning signals: (1) Data drift detection — compare input distributions against training reference (no labels needed). (2) Prediction distribution monitoring — track if the model's output distribution shifts. (3) Upstream data quality — monitor for missing values, schema changes. (4) Business proxy metrics — click-through rates, user engagement, complaint rates. When labels eventually arrive, compute actual accuracy and calibrate your proxy alert thresholds.

Q: What is PSI (Population Stability Index) and how is it used?

PSI measures how much a variable's distribution has shifted between two datasets (typically training vs production). It bins the variable, computes the proportion in each bin for both distributions, and sums: PSI = Σ (Actual% - Expected%) * ln(Actual%/Expected%). Rules of thumb: PSI < 0.1 = no significant shift, 0.1-0.25 = moderate shift (investigate), > 0.25 = significant drift (likely need retraining). It's commonly used in credit scoring and financial ML for regulatory compliance.

Q: What metrics should an ML model monitoring dashboard include?

A comprehensive dashboard should show: ML metrics — accuracy/F1/AUC over rolling windows, prediction distribution, feature drift scores. Data quality — missing value rates, schema violations, data freshness. Infrastructure — latency (p50/p95/p99), throughput (requests/sec), error rates, CPU/memory/GPU utilization. Business impact — downstream KPIs affected by the model (conversion rate, revenue). Include alerts for: drift exceeding thresholds, accuracy below SLA, latency spikes, and data pipeline failures.

Scenario Questions

Q: You inherit an ML model running in production with no documentation. How do you assess its health?

Systematic assessment: (1) Find the model artifact — where is it stored? What framework? What version? (2) Trace the training lineage — what data was it trained on? When? What code? Check for experiment tracking logs, git history, or model metadata. (3) Check for monitoring — are prediction distributions, error rates, or data drift being tracked? (4) Evaluate current performance — sample recent predictions, compare against ground truth if available. Run data drift analysis against the training data. (5) Review the serving setup — latency, error rates, scaling config. (6) Document everything you find and set up monitoring for what's missing.

Q: Your fraud detection model's precision dropped from 92% to 78% over 3 months. Diagnose and fix it.

Diagnosis: (1) Check data drift — has the distribution of transactions changed? New merchant categories, new geographies, new payment methods? (2) Check for concept drift — have fraud patterns evolved? New attack vectors the model hasn't seen? (3) Check feature pipelines — are all features still being computed correctly? Upstream data source changes? (4) Check class distribution — has the fraud rate changed significantly? Fix: (1) Retrain on recent labeled data (last 3-6 months). (2) Add new features that capture the shifted patterns. (3) Set up drift monitoring with automated retraining triggers. (4) Deploy via canary with the old model as fallback. (5) Consider shorter retraining cadence (weekly instead of quarterly).

Q: Design an MLOps architecture for a recommendation system serving 10M users.

Data layer: Feature store (Feast/Tecton) with user features (offline: historical behavior; online: real-time signals). Event streaming (Kafka) for real-time feature updates. Training: Scheduled daily retraining pipeline (Airflow/Kubeflow) on GPU cluster. Experiment tracking (MLflow). A/B test framework for model comparison. Serving: Two-stage architecture — candidate generation (ANN/FAISS for fast retrieval from millions of items) + ranking model (served via TorchServe/Triton on GPU, autoscaled via K8s). Feature store online store for low-latency feature lookups. Monitoring: Evidently for data/prediction drift. Business metrics dashboard (CTR, engagement). Alerts for latency > 100ms p99. Scale: Model sharding, prediction caching for popular items, CDN for precomputed recommendations.

Q: How would you set up CI/CD for an ML project?

ML CI/CD has three test layers: Code CI (on every PR) — lint, unit tests for feature functions, integration tests with sample data. Data CI (on pipeline trigger) — schema validation, distribution checks, freshness verification using Great Expectations. Model CI (after training) — metric thresholds (F1 > 0.85), champion/challenger comparison, latency checks (p99 < 100ms), bias/fairness audits. CD — if all gates pass, register model, deploy via canary (5% → 25% → 100%), with automatic rollback if error rate spikes. Use GitHub Actions or GitLab CI for code CI, and the pipeline orchestrator (Kubeflow/Airflow) for data and model CI.

Q: Your model works great in the notebook but fails in production. Common causes?

Common causes of notebook-to-production failures: (1) Training/serving skew — features computed differently in batch (notebook) vs real-time (serving). (2) Data leakage — notebook accidentally used future data or target-correlated features. (3) Different data distributions — notebook used clean, curated data; production data has missing values, outliers, new categories. (4) Environment differences — different library versions, Python version, or hardware (CPU vs GPU numerics). (5) Scale issues — model is too slow for real-time SLAs, or doesn't fit in serving container memory. (6) Missing preprocessing — notebook had manual cleaning steps not ported to the serving code.

Q: How do you handle model governance in a regulated industry (finance, healthcare)?

Governance requirements: (1) Model cards — document intended use, limitations, performance across subgroups, and ethical considerations. (2) Audit trail — full lineage from data to model to prediction (experiment tracking + model registry). (3) Bias/fairness testing — automated checks for disparate impact across protected attributes (race, gender, age). (4) Approval workflows — human review and sign-off before production deployment (model registry stage gates). (5) Explainability — SHAP values, LIME, or model-specific interpretability for each prediction. (6) Monitoring — drift detection and performance monitoring with regulatory reporting. (7) Rollback capability — ability to quickly revert to a previous model version.