Data Science Interview Questions

TL;DR

35+ data science interview questions organized by topic. Click "Show Answer" to reveal detailed answers. Covers statistics, machine learning, data wrangling, SQL, and real-world case studies.

Short on time? Focus on Statistics & Probability and Machine Learning — they come up most often.

Statistics & Probability

Q: What is the Central Limit Theorem and why is it important?

The CLT states that the sampling distribution of the mean approaches a normal distribution as sample size increases, regardless of the population's distribution. This is important because it allows us to use normal-based methods (t-tests, confidence intervals) even when the underlying data isn't normal. Rule of thumb: n ≥ 30 is usually sufficient.

Q: Explain the difference between Type I and Type II errors. Give an example.

Type I (False Positive): Rejecting a true null hypothesis. Example: concluding a drug works when it doesn't. Probability = α (significance level). Type II (False Negative): Failing to reject a false null hypothesis. Example: concluding a drug doesn't work when it does. Probability = β. Power = 1 - β. Increasing sample size reduces both errors.

Q: What is a p-value? What doesn't it tell you?

A p-value is the probability of observing results as extreme as (or more extreme than) the actual data, assuming the null hypothesis is true. It does NOT tell you: the probability that H0 is true, the size/importance of the effect, or whether the result is practically meaningful. A small p-value means the data is unlikely under H0, not that the alternative is certainly correct.

Q: When would you use a non-parametric test instead of a parametric test?

Use non-parametric tests when: 1) Data violates normality assumptions (heavily skewed). 2) Data is ordinal (rankings, Likert scales). 3) Small sample sizes where normality can't be verified. 4) Presence of outliers. Examples: Mann-Whitney U (instead of t-test), Kruskal-Wallis (instead of ANOVA), Spearman (instead of Pearson). Non-parametric tests are more robust but have less statistical power.

Q: Explain Bayes' Theorem with a practical example.

P(A|B) = P(B|A) × P(A) / P(B). Example: A disease affects 1% of the population. A test has 99% sensitivity (true positive rate) and 95% specificity (true negative rate). If someone tests positive, what's the probability they have the disease? P(disease|positive) = (0.99 × 0.01) / (0.99 × 0.01 + 0.05 × 0.99) = 0.0099 / 0.0594 ≈ 16.7%. Despite the "accurate" test, most positives are false positives because the disease is rare (low base rate).

Q: How do you design a proper A/B test?

1) Define the metric (e.g., conversion rate). 2) State hypotheses (H0: no difference). 3) Calculate required sample size using power analysis (typically 80% power, α=0.05). 4) Randomly assign users to control/treatment (ensure no selection bias). 5) Run for the pre-determined duration (no peeking). 6) Analyze with appropriate test (z-test for proportions, t-test for means). 7) Report both statistical and practical significance. Watch for: novelty effects, Simpson's paradox, and multiple comparisons.

Machine Learning

Q: What is the bias-variance tradeoff?

Bias = error from oversimplifying (model can't capture patterns → underfitting). Variance = error from being too sensitive to training data (model memorizes noise → overfitting). Total error = Bias² + Variance + Irreducible noise. Simple models (linear regression) have high bias, low variance. Complex models (deep trees) have low bias, high variance. The goal is the sweet spot. Techniques like regularization, ensemble methods, and cross-validation help balance this.

Q: Explain the difference between L1 and L2 regularization.

L1 (Lasso): Adds sum of absolute values of coefficients to the loss. Drives some coefficients to exactly zero → feature selection. Produces sparse models. L2 (Ridge): Adds sum of squared coefficients to the loss. Shrinks all coefficients toward zero but rarely to exactly zero. Better when all features contribute somewhat. Elastic Net combines both. L1 is better when you suspect many features are irrelevant; L2 when features are correlated.

Q: How does Random Forest reduce overfitting compared to a single Decision Tree?

Random Forest uses two techniques: Bagging (each tree trains on a random bootstrap sample of the data) and feature randomness (each split considers only a random subset of features). This makes trees diverse — they make different errors. Averaging many diverse trees reduces variance while keeping bias low. A single tree can overfit to any noise in the data, but hundreds of diverse trees averaging their predictions smooth out the noise.

Q: What is cross-validation and why is it better than a single train/test split?

Cross-validation (e.g., 5-fold) splits data into 5 parts, trains on 4, tests on 1, and rotates. This gives 5 scores instead of 1, providing a mean and standard deviation. Benefits: 1) More reliable performance estimate (not dependent on which rows land in test). 2) Uses all data for both training and testing. 3) Detects instability (high variance across folds = unreliable model). Use stratified k-fold for imbalanced classification.

Q: How do you handle class imbalance? (e.g., 95% negative, 5% positive)

Data-level: Oversample minority (SMOTE), undersample majority, or combine both. Algorithm-level: Use class_weight='balanced' in scikit-learn, or adjust decision threshold. Metric-level: Don't use accuracy — use precision, recall, F1, or AUC-ROC. Ensemble: BalancedRandomForest or EasyEnsemble. Also consider: collecting more minority data, treating it as anomaly detection, or using cost-sensitive learning.

Q: Explain precision, recall, and F1 score. When is each most important?

Precision = TP / (TP + FP) — of all predicted positives, how many are correct? Recall = TP / (TP + FN) — of all actual positives, how many did we catch? F1 = harmonic mean of precision and recall. Prioritize precision when false positives are costly (spam filter — don't mark real email as spam). Prioritize recall when false negatives are costly (cancer detection — don't miss a case). F1 balances both.

Q: What is gradient boosting and how does it differ from bagging?

Bagging (e.g., Random Forest) trains trees independently on random subsets, then averages predictions. Reduces variance. Boosting (e.g., XGBoost, LightGBM) trains trees sequentially — each new tree focuses on the errors of the previous one. Reduces bias. Boosting typically achieves higher accuracy but is more prone to overfitting and sensitive to hyperparameters. Bagging is more robust and parallelizable.

Data Wrangling & Preprocessing

Q: How do you handle missing data? What are the tradeoffs of each approach?

Drop rows: Simple, but loses data. Only if <5% missing and MCAR. Drop column: If >50% missing. Mean/median imputation: Preserves sample size but reduces variance and may weaken correlations. KNN imputation: Uses similar rows to estimate missing values. Better but slower. MICE (Multiple Imputation): Creates multiple imputed datasets and combines results. Gold standard but complex. Flag + impute: Create a "was_missing" indicator. Preserves the signal that data was missing. Always consider why data is missing (MCAR/MAR/MNAR).

Q: What is data leakage and how do you prevent it?

Data leakage occurs when information from the test set (or future data) "leaks" into training. Common causes: scaling/imputing before splitting, using features that wouldn't be available at prediction time, target leakage (feature derived from the target). Prevention: Always split first, then preprocess. Use scikit-learn Pipeline to ensure transforms fit only on training data. Audit features for temporal leakage. Signs: suspiciously high accuracy that drops in production.

Q: When would you normalize vs standardize features?

Normalization (Min-Max): Scales to [0,1]. Use when: the algorithm doesn't assume a distribution (KNN, neural networks) or you need bounded values. Sensitive to outliers. Standardization (Z-score): Scales to mean=0, std=1. Use when: algorithm assumes normality (Linear/Logistic Regression), data has outliers, or you're using regularization. Neither: Tree-based models (Random Forest, XGBoost) are scale-invariant and don't need scaling.

Q: How do you encode categorical variables?

One-Hot Encoding: Creates binary columns for each category. Use for nominal (no order) categories with <10-15 unique values. Label Encoding: Assigns integers. Only for ordinal data (low/medium/high) or tree-based models. Target Encoding: Replaces category with mean of target. Good for high-cardinality features but risks data leakage (use with cross-validation). Frequency Encoding: Replaces with count/frequency. Safe from leakage, captures rarity signal.

Q: What is feature selection and name three methods.

Feature selection removes irrelevant/redundant features to improve model performance and interpretability. Filter methods: Use statistical measures (correlation, mutual information, chi-squared) independent of the model. Fast but doesn't consider feature interactions. Wrapper methods: Train models with different feature subsets (forward selection, backward elimination, recursive feature elimination). Accurate but slow. Embedded methods: Feature selection built into the model (L1 regularization, tree-based feature importance). Good balance of speed and accuracy.

SQL & Data Manipulation

Q: What's the difference between WHERE and HAVING in SQL?

WHERE filters individual rows before grouping. HAVING filters groups after GROUP BY. Example: "Show departments where the average salary > $80K" requires HAVING because you need the group aggregate first. You can't use aggregate functions (AVG, SUM) in WHERE. Execution order: FROM → WHERE → GROUP BY → HAVING → SELECT → ORDER BY.

Q: Explain window functions with an example.

Window functions perform calculations across a set of rows without collapsing them (unlike GROUP BY). Example: SELECT name, salary, AVG(salary) OVER (PARTITION BY department) as dept_avg FROM employees. This adds the department average next to each employee without losing individual rows. Common functions: ROW_NUMBER(), RANK(), LAG(), LEAD(), SUM() OVER(). The PARTITION BY clause defines the "window" (group) and ORDER BY defines the sort within it.

Q: How would you find duplicate rows in SQL?

Use GROUP BY with HAVING COUNT > 1: SELECT email, COUNT(*) as cnt FROM users GROUP BY email HAVING COUNT(*) > 1. To see all duplicate rows: use a CTE with ROW_NUMBER: WITH ranked AS (SELECT *, ROW_NUMBER() OVER (PARTITION BY email ORDER BY id) as rn FROM users) SELECT * FROM ranked WHERE rn > 1. The second approach also lets you delete duplicates by deleting rows where rn > 1.

Q: What are the different types of JOINs? When would you use each?

INNER JOIN: Only matching rows from both tables. Use when you need rows that exist in both. LEFT JOIN: All rows from left table + matching from right (NULL if no match). Use to preserve all records from a primary table. RIGHT JOIN: Opposite of LEFT (rarely used, just swap table order). FULL OUTER JOIN: All rows from both, NULLs where no match. Use for finding mismatches. CROSS JOIN: Every combination (Cartesian product). Use for generating all possible pairs.

Case Studies & Problem Solving

Q: How would you build a recommendation system for an e-commerce site?

Collaborative filtering: Users who bought X also bought Y (user-based or item-based). Content-based: Recommend products similar to what the user has liked (based on product features). Hybrid: Combine both. Steps: 1) Collect implicit (clicks, views) and explicit (ratings) feedback. 2) Handle cold-start (new users/items) with popularity-based or content-based recommendations. 3) Build a user-item matrix, apply matrix factorization (SVD, ALS). 4) Evaluate with precision@K, recall@K, NDCG. 5) A/B test before deploying.

Q: Your model performs well in testing but poorly in production. What could be wrong?

Common causes: 1) Data drift — production data distribution has shifted from training data. 2) Data leakage — test set wasn't truly independent (information leaked during preprocessing). 3) Feature engineering bugs — features computed differently in production pipeline. 4) Training-serving skew — different code paths for training vs serving. 5) Concept drift — the relationship between features and target has changed over time. 6) Sampling bias — training data not representative of production traffic. Monitor predictions and retrain regularly.

Q: How would you approach a churn prediction problem?

1) Define churn: What counts as churn? (No login in 30 days? Cancelled subscription?) 2) Collect features: Usage patterns, demographics, support tickets, payment history, tenure. 3) Feature engineering: Trend features (declining usage), recency, frequency, monetary value (RFM). 4) Handle imbalance: Churn is typically 5-10%. Use SMOTE, class weights, or F1/AUC metrics. 5) Model: Start with Logistic Regression (interpretable), then try XGBoost. 6) Interpret: Which features drive churn? (SHAP values). 7) Action: Identify at-risk users for retention campaigns.

Q: You're given a dataset with 500 features and 10,000 rows. How do you approach this?

This is a high-dimensional problem (features >> sqrt(rows)). 1) Dimensionality reduction: PCA to find the most informative components. 2) Feature selection: Remove highly correlated features (correlation matrix), use L1 regularization (Lasso) to zero out irrelevant features, or use tree-based feature importance. 3) Regularization: Absolutely necessary — use Ridge, Lasso, or Elastic Net. 4) Cross-validation: Critical with limited data. 5) Consider: domain knowledge to manually select relevant features. Avoid overfitting — simpler models often win here.

Q: How do you communicate model results to non-technical stakeholders?

1) Lead with business impact: "This model identifies 80% of customers who will churn, saving an estimated $2M/year in retention costs." 2) Use visualizations: confusion matrices as 2x2 tables with real labels (not TP/FP), feature importance as bar charts, ROC curves as "tradeoff between catching more positives vs more false alarms." 3) Give concrete examples: "For customer X, the top factors predicting churn are: no login in 14 days, 3 support tickets, and downgraded plan." 4) Quantify uncertainty: "We're 90% confident the improvement is between 5-12%." 5) Avoid jargon: Say "accuracy" not "F1-macro weighted."

Coding & Practical

Q: Write a function to calculate the moving average of a list.

Using pandas: df['col'].rolling(window=7).mean(). From scratch:

def moving_average(data, window):
    result = []
    for i in range(len(data)):
        if i < window - 1:
            result.append(None)
        else:
            result.append(sum(data[i-window+1:i+1]) / window)
    return result

Or with NumPy: np.convolve(data, np.ones(window)/window, mode='valid')

Q: How would you detect multicollinearity in your features?

1) Correlation matrix: Check for pairs with |r| > 0.8. 2) VIF (Variance Inflation Factor): VIF > 5-10 indicates multicollinearity. Calculate with from statsmodels.stats.outliers_influence import variance_inflation_factor. 3) Condition number: >30 suggests multicollinearity. Fixes: Remove one of the correlated features, combine them (PCA), or use regularization (Ridge handles it well). Multicollinearity makes individual coefficients unreliable but doesn't hurt prediction accuracy.

Q: Explain the scikit-learn Pipeline and why it matters.

A Pipeline chains preprocessing steps and a model into a single object. Pipeline([('scaler', StandardScaler()), ('model', LogisticRegression())]). Benefits: 1) Prevents data leakage — scaler fits only on training data during cross-validation. 2) Cleaner code — one fit() and predict() call. 3) Easy to serialize for production (pickle the whole pipeline). 4) Works with GridSearchCV for hyperparameter tuning across all steps. Always use Pipelines in production code.

Q: What is SHAP and how does it explain model predictions?

SHAP (SHapley Additive exPlanations) assigns each feature a contribution value for every prediction based on game theory. For a prediction of $350K for a house, SHAP might show: location (+$80K), size (+$50K), age (-$30K) vs the baseline average of $250K. Benefits: consistent, theoretically grounded, works with any model, provides both local (per-prediction) and global (overall) explanations. Usage: import shap; explainer = shap.TreeExplainer(model); shap_values = explainer.shap_values(X)

Behavioral & Communication

Q: Tell me about a time you worked with messy data. What did you do?

Structure your answer: Situation (what was the project, what data issues existed), Task (what needed to be clean for the analysis), Action (specific steps: profiling with df.info(), handling nulls by strategy, fixing types, documenting assumptions), Result (outcome: "cleaned 50K records, reduced null rate from 30% to 2%, analysis revealed X insight"). Key: show you have a systematic process, not ad-hoc fixes. Mention documenting your decisions for reproducibility.

Q: How do you decide which model to use for a given problem?

1) Understand the problem: Classification vs regression? Interpretability needed? Real-time or batch? 2) Start simple: Linear/Logistic Regression as baseline. 3) Consider data size: Small data → simpler models (regularized regression). Large data → can support complex models (ensembles, neural nets). 4) Try 2-3 candidates: Random Forest, XGBoost, and the baseline. Compare with cross-validation. 5) Consider constraints: Inference speed, model size, interpretability requirements, team expertise. Often the simplest model that meets the performance threshold is the best choice.