Data Wrangling & Preprocessing

TL;DR

Data wrangling is the process of cleaning raw data into analysis-ready format. Handle missing values (impute or drop), fix data types, remove duplicates, detect outliers (IQR or Z-score), and reshape data (melt/pivot). pandas is your primary tool. Expect to spend 60-80% of project time here.

Explain Like I'm 12

Imagine you just got a giant box of LEGO pieces, but they're all mixed up, some are broken, and some aren't even LEGO. Before you can build anything cool, you need to: sort them (data types), throw away broken ones (remove bad data), find missing pieces (handle nulls), and organize by color and size (reshape). That's data wrangling — getting your data ready before the fun part starts.

The Wrangling Workflow

Data wrangling follows a predictable pattern. Start by understanding what you have, then fix issues systematically.

Data wrangling workflow: inspect, clean missing values, fix types, remove duplicates, handle outliers, reshape

Step 1: Inspect Your Data

Before fixing anything, understand what you're working with. These five commands tell you almost everything you need to know.

import pandas as pd

df = pd.read_csv('dataset.csv')

# Shape: how many rows and columns?
print(df.shape)              # (10000, 15)

# Column types and null counts
print(df.info())

# Statistical summary of numeric columns
print(df.describe())

# First few rows
print(df.head())

# Unique values per column (spot unexpected categories)
for col in df.select_dtypes(include='object'):
    print(f"{col}: {df[col].nunique()} unique — {df[col].unique()[:5]}")
Tip: Run df.info() first every time. It shows data types and null counts in one view. If a column that should be numeric shows as object, you have dirty values in it (like "$" or "N/A" strings).

Step 2: Handle Missing Values

Missing data is the most common issue. Your strategy depends on how much is missing and why it's missing.

StrategyWhen to UseCode
Drop rows<5% missing, randomdf.dropna(subset=['col'])
Drop column>50% missingdf.drop('col', axis=1)
Fill with medianNumeric, has outliersdf['col'].fillna(df['col'].median())
Fill with meanNumeric, normal distributiondf['col'].fillna(df['col'].mean())
Fill with modeCategoricaldf['col'].fillna(df['col'].mode()[0])
Forward fillTime seriesdf['col'].ffill()
Flag + fillMissingness is informativeCreate col_was_missing boolean column
# Visualize missing data patterns
missing_pct = (df.isnull().sum() / len(df) * 100).sort_values(ascending=False)
print(missing_pct[missing_pct > 0])

# Strategy: create a "was_missing" flag, then impute
df['income_missing'] = df['income'].isnull().astype(int)
df['income'].fillna(df['income'].median(), inplace=True)

# For categorical columns, fill with the mode
df['city'].fillna(df['city'].mode()[0], inplace=True)

# For time series, use forward fill
df['temperature'].ffill(inplace=True)
Warning: Never impute the target variable (the column you're trying to predict). If the target is missing, drop the row — imputing it would be making up the answer.

Step 3: Fix Data Types

Wrong types silently break calculations. A "price" column stored as a string can't be summed or averaged.

# String prices → numeric (remove $, commas)
df['price'] = df['price'].str.replace('[$,]', '', regex=True).astype(float)

# String dates → datetime
df['date'] = pd.to_datetime(df['date'], format='%Y-%m-%d')

# Categories with few unique values → category type (saves memory)
df['status'] = df['status'].astype('category')

# Boolean strings → actual booleans
df['is_active'] = df['is_active'].map({'yes': True, 'no': False, 'Yes': True, 'No': False})
Tip: Use pd.to_numeric(df['col'], errors='coerce') to convert columns that have mixed types. The errors='coerce' turns unparseable values into NaN instead of throwing an error, so you can find and fix them.

Step 4: Remove Duplicates

Duplicates inflate your dataset and bias your model. Check for exact duplicates and near-duplicates (same entity, slightly different data).

# Exact duplicates
print(f"Duplicates: {df.duplicated().sum()}")
df.drop_duplicates(inplace=True)

# Duplicates on specific columns (e.g., same customer, same date)
df.drop_duplicates(subset=['customer_id', 'order_date'], keep='last', inplace=True)

# Near-duplicates: check for similar but not identical records
df[df.duplicated(subset=['name', 'email'], keep=False)].sort_values('name')
Info: The keep parameter controls which duplicate to keep: 'first' (default), 'last', or False (drop all duplicates).

Step 5: Handle Outliers

Outliers are extreme values that can skew your analysis and mislead your model. Detect them with IQR or Z-scores, then decide: keep, cap, or remove.

import numpy as np

# Method 1: IQR (Interquartile Range)
Q1 = df['salary'].quantile(0.25)
Q3 = df['salary'].quantile(0.75)
IQR = Q3 - Q1
lower = Q1 - 1.5 * IQR
upper = Q3 + 1.5 * IQR

outliers = df[(df['salary'] < lower) | (df['salary'] > upper)]
print(f"Outliers found: {len(outliers)}")

# Method 2: Z-score (assumes normal distribution)
from scipy import stats
z_scores = np.abs(stats.zscore(df['salary'].dropna()))
outlier_mask = z_scores > 3  # beyond 3 standard deviations

# Strategy: Cap (winsorize) instead of removing
df['salary'] = df['salary'].clip(lower=lower, upper=upper)
StrategyWhen to UseProsCons
RemoveData entry errorsClean datasetLose information
Cap (winsorize)Real but extreme valuesKeeps all rowsDistorts extremes
KeepImportant signal (fraud)Preserves truthMay skew models
Log transformSkewed distributionsReduces impactHarder to interpret
Warning: Not all outliers are errors! A $10M sale might be unusual but real. Always check the business context before removing outliers.

Step 6: Reshape Data

Sometimes data arrives in the wrong shape. Use melt (wide → long) and pivot (long → wide) to restructure it for your analysis.

# Wide → Long (melt)
# Before: columns = ['name', 'jan_sales', 'feb_sales', 'mar_sales']
df_long = df.melt(
    id_vars=['name'],
    value_vars=['jan_sales', 'feb_sales', 'mar_sales'],
    var_name='month',
    value_name='sales'
)

# Long → Wide (pivot)
df_wide = df_long.pivot_table(
    index='name', columns='month', values='sales', aggfunc='sum'
).reset_index()

# GroupBy aggregation
summary = df.groupby('category').agg(
    total_sales=('sales', 'sum'),
    avg_price=('price', 'mean'),
    num_orders=('order_id', 'count')
).reset_index()
Tip: Use pivot_table instead of pivot when you have duplicate index/column combinations. pivot_table lets you specify an aggfunc to handle them.

Test Yourself

A column called "age" has 15% missing values. What strategy would you use and why?

With 15% missing, dropping rows would lose too much data. Best approach: impute with the median (robust to outliers in age data). If age is a strong predictor, also create an age_was_missing flag column — the fact that age is missing might itself be informative (e.g., users who skip optional fields).

What's the difference between df.drop_duplicates() and df.drop_duplicates(subset=['col'])?

df.drop_duplicates() removes rows where ALL columns are identical. df.drop_duplicates(subset=['col']) removes rows where only the specified column(s) are identical, keeping the first occurrence by default. Use subset when you want to deduplicate by a business key (like customer ID) rather than requiring an exact match on every field.

When should you use IQR vs Z-score for outlier detection?

IQR method: Works on any distribution (doesn't assume normality). Good default choice. Uses Q1 - 1.5*IQR and Q3 + 1.5*IQR as bounds. Z-score: Assumes data is roughly normal. Uses standard deviations from the mean (typically |z| > 3). Better for normally distributed data. If the data is heavily skewed, IQR is more reliable.

A "price" column reads as object type instead of float64. What's likely wrong and how do you fix it?

The column likely contains non-numeric characters like "$", commas, or text entries like "N/A". Fix: df['price'] = df['price'].str.replace('[$,]', '', regex=True) to strip symbols, then pd.to_numeric(df['price'], errors='coerce') to convert. The errors='coerce' turns unparseable values into NaN so you can find and handle them separately.

When would you use melt() vs pivot_table()?

melt() converts wide to long format — use when columns represent categories (like months) that should be rows. Example: columns jan_sales, feb_sales → one month column + one sales column. pivot_table() converts long to wide — use when you want to see values spread across columns. Example: one row per product with sales in separate month columns.

Interview Questions

How do you handle a column with 40% missing values?

Options: 1) If the column isn't important for your model, drop it. 2) If it's important, impute with median/mode and add a was_missing indicator column. 3) Use advanced imputation (KNN imputer or iterative imputer from scikit-learn) that predicts missing values from other columns. 4) Use algorithms that handle nulls natively (XGBoost, LightGBM). Always analyze why data is missing (MCAR, MAR, or MNAR) before choosing a strategy.

What's the difference between MCAR, MAR, and MNAR?

MCAR (Missing Completely at Random): Missingness is unrelated to any variable (rare). Safe to drop rows. MAR (Missing at Random): Missingness depends on other observed variables (e.g., younger people skip income questions). Imputation works well. MNAR (Missing Not at Random): Missingness depends on the missing value itself (e.g., high-income people don't report income). Hardest to handle — may need domain-specific approaches or sensitivity analysis.

How would you detect and handle data leakage during preprocessing?

Data leakage occurs when information from the test set "leaks" into training. Common during preprocessing: if you compute the mean on the entire dataset before splitting, test data influences training imputation. Fix: Always split first, then fit transformations only on training data. Use scikit-learn Pipeline with fit_transform on train and transform on test. Signs of leakage: suspiciously high accuracy that drops in production.