Statistical Analysis & Probability

TL;DR

Statistics is the math backbone of data science. Know your distributions (normal, binomial, Poisson), run hypothesis tests (t-test, chi-squared, ANOVA) to prove results aren't random, calculate confidence intervals to quantify uncertainty, and design A/B tests to measure real impact. Python's scipy and statsmodels make it all practical.

Explain Like I'm 12

Imagine you flip a coin 100 times and get 60 heads. Is the coin unfair, or did you just get lucky? Statistics answers that question. It tells you: "There's only a 3% chance of getting 60+ heads with a fair coin, so the coin is probably loaded." That 3% is called a p-value. Statisticians set a threshold (usually 5%) — if the p-value is below it, we say the result is statistically significant (not just luck).

The Statistical Toolkit

Every data science decision involves uncertainty. Statistics gives you tools to quantify that uncertainty and make defensible decisions.

Statistical analysis toolkit: descriptive stats, probability distributions, hypothesis testing, confidence intervals, A/B testing

Descriptive Statistics

Before testing hypotheses, summarize your data. These are the numbers that describe the shape and spread of your dataset.

import numpy as np
import pandas as pd

# Central tendency
mean = df['salary'].mean()       # average (sensitive to outliers)
median = df['salary'].median()   # middle value (robust to outliers)
mode = df['salary'].mode()[0]    # most common value

# Spread
std = df['salary'].std()         # standard deviation
var = df['salary'].var()         # variance (std²)
iqr = df['salary'].quantile(0.75) - df['salary'].quantile(0.25)  # IQR

# Shape
skew = df['salary'].skew()      # >0 = right-skewed, <0 = left-skewed
kurt = df['salary'].kurtosis()   # >0 = heavy tails, <0 = light tails

# Quick summary
print(df['salary'].describe())
MeasureWhat It Tells YouWhen to Use
MeanAverage valueSymmetric distributions without outliers
MedianMiddle value (50th percentile)Skewed data or data with outliers
Standard DeviationHow spread out values areUnderstanding variability
IQRSpread of the middle 50%Robust measure of spread
SkewnessAsymmetry directionDeciding if transforms are needed
Tip: If mean >> median, your data is right-skewed (a few very high values pull the mean up). Use the median for a more representative "typical" value. Common example: income data.

Probability Distributions

A distribution describes the possible values a variable can take and how likely each is. Knowing the right distribution lets you model uncertainty correctly.

Normal (Gaussian) Distribution

The bell curve. Defined by mean (μ) and standard deviation (σ). Most natural phenomena follow it (heights, test scores, measurement errors).

from scipy import stats
import numpy as np

# Generate normally distributed data
data = np.random.normal(loc=100, scale=15, size=1000)  # μ=100, σ=15

# 68-95-99.7 rule
within_1_std = np.mean(np.abs(data - 100) < 15)   # ≈ 68%
within_2_std = np.mean(np.abs(data - 100) < 30)   # ≈ 95%
within_3_std = np.mean(np.abs(data - 100) < 45)   # ≈ 99.7%

# Probability of a value > 130
p = 1 - stats.norm.cdf(130, loc=100, scale=15)  # ≈ 0.023 (2.3%)
print(f"P(X > 130) = {p:.4f}")
Info: The Central Limit Theorem says that averages of samples from any distribution become normal as sample size grows. This is why the normal distribution is so important — it lets you use hypothesis tests even when the underlying data isn't normal.

Other Key Distributions

DistributionUse CaseExample
BinomialCount of successes in n trialsHow many emails are spam out of 100?
PoissonCount of events in a time windowHow many support tickets per hour?
ExponentialTime between eventsTime between customer arrivals
UniformEqual probability across rangeRandom number generator output
Chi-squaredSum of squared normal variablesCategorical independence testing

Hypothesis Testing

Hypothesis testing answers: "Is this result real or just random noise?" You set up two competing hypotheses, collect data, and use a test statistic to decide.

The 5-Step Process

  1. State hypotheses — H0 (null: no effect) vs Ha (alternative: there is an effect)
  2. Choose significance level — Usually α = 0.05 (5% chance of false positive)
  3. Select the right test — Based on data type and question
  4. Calculate p-value — Probability of seeing this result if H0 is true
  5. Decide — If p < α, reject H0. If p ≥ α, fail to reject H0
from scipy import stats

# t-test: Is the average salary different from $75,000?
sample_salaries = df['salary'].dropna()
t_stat, p_value = stats.ttest_1samp(sample_salaries, 75000)
print(f"t-statistic: {t_stat:.3f}, p-value: {p_value:.4f}")

if p_value < 0.05:
    print("Reject H0: Average salary IS significantly different from $75K")
else:
    print("Fail to reject H0: No significant difference from $75K")

Choosing the Right Test

QuestionTestscipy Function
Is the mean different from a value?One-sample t-teststats.ttest_1samp()
Are two group means different?Two-sample t-teststats.ttest_ind()
Are paired measurements different?Paired t-teststats.ttest_rel()
Are 3+ group means different?ANOVAstats.f_oneway()
Are two categorical variables related?Chi-squared teststats.chi2_contingency()
Is the data normally distributed?Shapiro-Wilk teststats.shapiro()
Non-normal, two groups?Mann-Whitney U teststats.mannwhitneyu()
Warning: A low p-value does NOT mean the effect is large or important. Statistical significance ≠ practical significance. With a huge sample, even a tiny difference can be "significant." Always report effect size alongside p-values.

Confidence Intervals

A confidence interval gives you a range of plausible values for a parameter, not just a single estimate.

from scipy import stats
import numpy as np

sample = df['conversion_rate'].dropna()
mean = sample.mean()
se = stats.sem(sample)  # standard error of the mean

# 95% confidence interval
ci_95 = stats.t.interval(
    confidence=0.95,
    df=len(sample) - 1,
    loc=mean,
    scale=se
)
print(f"Mean: {mean:.4f}")
print(f"95% CI: ({ci_95[0]:.4f}, {ci_95[1]:.4f})")
# "We're 95% confident the true conversion rate is between X and Y"
Info: A 95% CI means: if we repeated this experiment 100 times, ~95 of those intervals would contain the true value. It does NOT mean there's a 95% probability the true value is in this specific interval (a common misconception).

Correlation

Correlation measures the strength and direction of a linear relationship between two variables. It ranges from -1 (perfect negative) to +1 (perfect positive).

import pandas as pd

# Pearson correlation (linear relationships, assumes normality)
pearson_corr = df['study_hours'].corr(df['exam_score'])

# Spearman correlation (monotonic relationships, robust to outliers)
spearman_corr = df['study_hours'].corr(df['exam_score'], method='spearman')

# Full correlation matrix
corr_matrix = df.select_dtypes(include='number').corr()

# Statistical significance of correlation
from scipy import stats
r, p_value = stats.pearsonr(df['study_hours'], df['exam_score'])
print(f"r = {r:.3f}, p = {p_value:.4f}")
Warning: Correlation does not imply causation. Ice cream sales and drowning rates are correlated (both increase in summer) but one doesn't cause the other. The confounding variable is temperature.

A/B Testing

A/B testing is hypothesis testing applied to real-world experiments. Show group A the current version, group B the new version, and measure if B is better.

from scipy import stats
import numpy as np

# Example: Testing a new checkout button
# Group A (control): 1000 users, 50 conversions (5.0%)
# Group B (variant): 1000 users, 65 conversions (6.5%)

n_a, conv_a = 1000, 50
n_b, conv_b = 1000, 65

rate_a = conv_a / n_a  # 0.050
rate_b = conv_b / n_b  # 0.065

# Two-proportion z-test
from statsmodels.stats.proportion import proportions_ztest

count = np.array([conv_b, conv_a])
nobs = np.array([n_b, n_a])

z_stat, p_value = proportions_ztest(count, nobs, alternative='larger')
print(f"Z-statistic: {z_stat:.3f}")
print(f"P-value: {p_value:.4f}")

# Effect size (relative lift)
lift = (rate_b - rate_a) / rate_a * 100
print(f"Lift: {lift:.1f}%")

# Sample size calculator (for planning)
from statsmodels.stats.power import NormalIndPower
analysis = NormalIndPower()
sample_size = analysis.solve_power(
    effect_size=0.1,  # expected small effect
    alpha=0.05,
    power=0.8,        # 80% chance of detecting the effect
    ratio=1            # equal group sizes
)
print(f"Required sample per group: {int(np.ceil(sample_size))}")
Tip: Always calculate sample size BEFORE running the test. Running a test for too short a time leads to noisy results. Too long wastes resources. Use a power analysis to determine the minimum sample size needed to detect your expected effect size.

Test Yourself

Your A/B test shows a p-value of 0.03. The conversion rate improved from 4.0% to 4.2%. Should you ship the change?

It's statistically significant (p < 0.05), but the practical significance is questionable. A 0.2 percentage point lift (5% relative) is tiny. Consider: How much revenue does that 0.2% represent? Is the implementation cost worth it? What's the confidence interval — could the true effect be near zero? Statistical significance alone isn't enough; always evaluate business impact.

What's the difference between a Type I and Type II error?

Type I (False Positive): Rejecting H0 when it's actually true. You conclude there's an effect when there isn't. Controlled by α (significance level, usually 0.05). Type II (False Negative): Failing to reject H0 when it's actually false. You miss a real effect. Controlled by statistical power (1 - β, usually aim for 0.80). Reducing one type of error increases the other unless you increase sample size.

When would you use Spearman correlation instead of Pearson?

Use Spearman when: 1) The relationship is monotonic but not linear (e.g., logarithmic). 2) Data has outliers (Spearman uses ranks, so outliers don't dominate). 3) Data is ordinal (ratings: poor/good/excellent). 4) Normality assumption is violated. Pearson is fine when data is roughly normal and the relationship appears linear.

You have 3 groups and want to compare their means. Why can't you just run 3 t-tests?

Running multiple t-tests inflates the Type I error rate. With 3 comparisons at α=0.05, the overall error rate is ~14.3% (1 - 0.95³), not 5%. Use ANOVA first to test if any group differs, then post-hoc tests (Tukey's HSD or Bonferroni correction) to find which specific pairs differ.

What does a 95% confidence interval actually mean?

If we repeated the sampling process infinitely and computed a CI each time, 95% of those intervals would contain the true population parameter. It does NOT mean "there's a 95% probability the true value is in this interval." Any specific interval either contains the true value or it doesn't — we just don't know which. The "95%" refers to the method's long-run reliability, not to any single interval.

Interview Questions

Explain the Central Limit Theorem and why it matters for data science.

The CLT states that the distribution of sample means approaches a normal distribution as sample size increases, regardless of the underlying population distribution. This matters because: 1) It justifies using t-tests and z-tests even when data isn't normal. 2) It enables confidence intervals and hypothesis testing. 3) It explains why many natural phenomena appear normal (they're averages of many small effects). Rule of thumb: n ≥ 30 is usually sufficient for the CLT to apply.

Your A/B test has been running for a week and the p-value is 0.06. Should you run it longer?

It depends on whether you pre-committed to a sample size. If you did, stick with it — extending the test because p is close to 0.05 is called p-hacking and inflates false positive rates. If you hadn't hit the planned sample size yet, keep running. Best practice: decide the duration/sample size before the test starts using power analysis, and don't peek at results until it's done (or use sequential testing methods designed for interim analysis).

How would you handle multiple comparisons in a hypothesis testing scenario?

Multiple comparisons inflate false positive rates. Solutions: Bonferroni correction (divide α by number of tests — simple but conservative), Holm-Bonferroni (step-down, less conservative), Benjamini-Hochberg (controls False Discovery Rate instead of family-wise error, preferred in exploratory analysis), or ANOVA + post-hoc (for comparing group means). The choice depends on whether you prioritize controlling false positives vs. false negatives.