Machine Learning Fundamentals

TL;DR

Machine learning teaches computers to learn from data. Supervised learning uses labeled data (regression for numbers, classification for categories). Unsupervised learning finds patterns without labels (clustering, dimensionality reduction). Key algorithms: Linear Regression, Logistic Regression, Decision Trees, Random Forest, XGBoost, K-Means. Always split data, tune hyperparameters, and watch for overfitting.

Explain Like I'm 12

Imagine you're learning to recognize dogs vs cats. Supervised learning is like having a teacher show you labeled photos: "This is a dog, this is a cat." After enough examples, you can label new photos yourself. Unsupervised learning is like sorting a pile of photos into groups without being told what's what — you notice some have pointy ears and some have round ears. Overfitting is like memorizing the exact photos instead of learning the general idea — you'd fail on new photos you haven't seen.

The ML Landscape

Machine learning algorithms fall into three main categories based on how they learn from data.

Machine learning taxonomy: supervised (regression, classification), unsupervised (clustering, dimensionality reduction), and evaluation workflow

Supervised Learning

You give the algorithm inputs (features) and correct answers (labels). It learns the mapping from inputs to outputs, then predicts labels for new data.

Regression (Predicting Numbers)

When your target is a continuous number — house prices, temperature, revenue.

from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score
import numpy as np

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Linear Regression: simplest baseline
model = LinearRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

print(f"RMSE: {np.sqrt(mean_squared_error(y_test, y_pred)):.2f}")
print(f"R² Score: {r2_score(y_test, y_pred):.3f}")

# Feature importance (coefficients)
for feat, coef in zip(X.columns, model.coef_):
    print(f"  {feat}: {coef:.3f}")
Info: R² score tells you what fraction of variance your model explains. R² = 0.85 means 85% of the variation in the target is captured. RMSE tells you the average prediction error in the same units as your target.

Classification (Predicting Categories)

When your target is a category — spam/not spam, churn/retain, disease/healthy.

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix

model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

# Confusion matrix: actual vs predicted
print(confusion_matrix(y_test, y_pred))
# [[TN, FP],
#  [FN, TP]]

# Full report
print(classification_report(y_test, y_pred))
AlgorithmTypeBest ForProsCons
Linear RegressionRegressionLinear relationshipsFast, interpretableAssumes linearity
Logistic RegressionClassificationBinary outcomesFast, probabilisticLinear decision boundary
Decision TreeBothNon-linear, interpretableEasy to visualizeOverfits easily
Random ForestBothGeneral purposeRobust, feature importanceSlow on large data
XGBoostBothTabular data competitionsBest accuracy oftenMore hyperparameters
KNNBothSmall datasetsSimple, no trainingSlow at prediction time
Tip: Start with Logistic Regression (classification) or Linear Regression (regression) as your baseline. Then try Random Forest or XGBoost. Often the simple model is good enough, and it gives you a benchmark to beat.

Unsupervised Learning

No labels — the algorithm discovers structure in the data on its own. Used for customer segmentation, anomaly detection, and dimensionality reduction.

Clustering (Grouping Similar Items)

from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler

# Always scale features before clustering
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# K-Means: group customers into 4 segments
kmeans = KMeans(n_clusters=4, random_state=42, n_init=10)
df['cluster'] = kmeans.fit_predict(X_scaled)

# Analyze clusters
print(df.groupby('cluster')[['age', 'income', 'spending']].mean())

# How to pick K? Elbow method
inertias = []
for k in range(1, 11):
    km = KMeans(n_clusters=k, random_state=42, n_init=10)
    km.fit(X_scaled)
    inertias.append(km.inertia_)
Warning: Always scale your features before clustering. If "income" ranges from 0-200,000 and "age" from 0-80, the algorithm will cluster almost entirely on income because it has larger values. StandardScaler puts all features on the same scale.

Dimensionality Reduction

Reduce the number of features while preserving the most important information. PCA (Principal Component Analysis) is the most common method.

from sklearn.decomposition import PCA
import matplotlib.pyplot as plt

# Reduce 50 features down to 2 for visualization
pca = PCA(n_components=2)
X_2d = pca.fit_transform(X_scaled)

# How much variance does each component explain?
print(f"Explained variance: {pca.explained_variance_ratio_}")
# e.g., [0.45, 0.20] = first 2 components capture 65% of variance

# Plot
plt.scatter(X_2d[:, 0], X_2d[:, 1], c=df['cluster'], cmap='viridis')
plt.xlabel('PC1')
plt.ylabel('PC2')
plt.title('Customer Segments (PCA)')
plt.show()
Info: PCA is great for visualization (reduce to 2-3 dims to plot) and speeding up models (fewer features = faster training). It's also useful for removing correlated features.

The Bias-Variance Tradeoff

The fundamental challenge in ML: your model must be complex enough to capture real patterns (low bias) but simple enough to generalize to new data (low variance).

Underfitting (High Bias)Good FitOverfitting (High Variance)
Train ScoreLowHighVery High (~100%)
Test ScoreLowHigh (close to train)Low (big gap from train)
ModelToo simpleJust rightToo complex
FixMore features, complex modelShip it!More data, regularization, simpler model
# Detect overfitting: compare train vs test scores
from sklearn.ensemble import RandomForestClassifier

model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)

train_score = model.score(X_train, y_train)
test_score = model.score(X_test, y_test)

print(f"Train: {train_score:.3f}")
print(f"Test:  {test_score:.3f}")
# If train ≈ 0.99 and test ≈ 0.75 → overfitting!

# Fix: reduce complexity
model = RandomForestClassifier(
    n_estimators=100,
    max_depth=10,        # limit tree depth
    min_samples_leaf=5,  # require more samples per leaf
    random_state=42
)
Warning: A model with 99% training accuracy but 75% test accuracy is overfitting. It memorized the training data but can't generalize. The gap between train and test scores is the key signal.

Hyperparameter Tuning

Hyperparameters are settings you choose before training (like max_depth, n_estimators). Use grid search or random search to find the best combination.

from sklearn.model_selection import GridSearchCV

param_grid = {
    'n_estimators': [100, 200, 500],
    'max_depth': [5, 10, 20, None],
    'min_samples_leaf': [1, 2, 5]
}

grid = GridSearchCV(
    RandomForestClassifier(random_state=42),
    param_grid,
    cv=5,
    scoring='f1',
    n_jobs=-1  # use all CPU cores
)
grid.fit(X_train, y_train)

print(f"Best params: {grid.best_params_}")
print(f"Best CV F1:  {grid.best_score_:.3f}")

# Use the best model
best_model = grid.best_estimator_
print(f"Test F1: {best_model.score(X_test, y_test):.3f}")
Tip: Use RandomizedSearchCV instead of GridSearchCV when you have many hyperparameters. It samples random combinations and is much faster while finding nearly-as-good results.

Test Yourself

You're predicting whether a customer will churn (yes/no). Which type of ML problem is this, and what's a good baseline algorithm?

This is a binary classification problem (supervised learning). A good baseline is Logistic Regression — it's fast, interpretable, outputs probabilities, and handles this well. Evaluate with precision, recall, and F1 (not just accuracy, since churn is often imbalanced). Then try Random Forest or XGBoost to see if you can beat the baseline.

Your model has 98% training accuracy but 72% test accuracy. What's happening and how do you fix it?

The model is overfitting — it memorized training data but can't generalize. Fixes: 1) Get more training data. 2) Simplify the model (reduce max_depth, increase min_samples_leaf). 3) Add regularization (L1/L2). 4) Remove noisy or redundant features. 5) Use cross-validation instead of a single split.

Why do you need to scale features before K-Means clustering?

K-Means uses Euclidean distance to measure similarity. If features have different scales (income: 0-200K, age: 0-80), distance is dominated by the larger-scale feature. Scaling (e.g., StandardScaler) puts all features on the same scale so each contributes equally to the distance calculation.

When would you use PCA?

Use PCA when: 1) You have many correlated features and want to reduce them. 2) You want to visualize high-dimensional data in 2D/3D. 3) Training is too slow due to too many features. 4) You want to remove noise (minor components often capture noise). Keep enough components to explain 90-95% of variance.

What's the difference between GridSearchCV and RandomizedSearchCV?

GridSearchCV tries every combination of hyperparameters (exhaustive but slow). RandomizedSearchCV samples a fixed number of random combinations (faster, especially with many parameters). In practice, RandomizedSearchCV with 50-100 iterations often finds results within 1-2% of the grid search optimum in a fraction of the time.

Interview Questions

Explain the bias-variance tradeoff. How does it relate to model complexity?

Bias is error from wrong assumptions (underfitting). Variance is error from sensitivity to training data (overfitting). Simple models have high bias + low variance. Complex models have low bias + high variance. The goal is the sweet spot: enough complexity to capture patterns, not so much that it memorizes noise. Regularization, cross-validation, and ensemble methods all help balance this tradeoff.

How does Random Forest improve on a single Decision Tree?

Random Forest is an ensemble of many decision trees, each trained on a random subset of data (bagging) and random subset of features. This reduces variance (overfitting) because individual tree errors cancel out when averaged. A single tree can overfit badly; 100 trees voting together generalize much better. The randomness ensures trees are diverse — if they all made the same mistakes, averaging wouldn't help.

When would you choose Logistic Regression over Random Forest?

Choose Logistic Regression when: 1) You need interpretability (coefficients explain feature importance). 2) You have a small dataset (fewer parameters = less overfitting risk). 3) The relationship is roughly linear. 4) You need probability outputs (calibrated by default). 5) You need speed in production (much faster inference). Random Forest is better when relationships are non-linear and you have enough data.

What is regularization and why is it used?

Regularization adds a penalty for model complexity to the loss function. L1 (Lasso): penalizes absolute value of coefficients, drives some to exactly zero (feature selection). L2 (Ridge): penalizes squared coefficients, shrinks them toward zero but rarely to exactly zero. Elastic Net: combines both. Purpose: prevent overfitting by discouraging the model from relying too heavily on any single feature or fitting noise.