Data Wrangling & Preprocessing

TL;DR

Data wrangling is the process of cleaning raw data into analysis-ready format. Handle missing values (impute or drop), fix data types, remove duplicates, detect outliers (IQR or Z-score), and reshape data (melt/pivot). pandas is your primary tool. Expect to spend 60-80% of project time here.

Explain Like I'm 12

Imagine you just got a giant box of LEGO pieces, but they're all mixed up, some are broken, and some aren't even LEGO. Before you can build anything cool, you need to: sort them (data types), throw away broken ones (remove bad data), find missing pieces (handle nulls), and organize by color and size (reshape). That's data wrangling — getting your data ready before the fun part starts.

The Wrangling Workflow

Data wrangling follows a predictable pattern. Start by understanding what you have, then fix issues systematically.

Data wrangling workflow: inspect, clean missing values, fix types, remove duplicates, handle outliers, reshape

Step 1: Inspect Your Data

Before fixing anything, understand what you're working with. These five commands tell you almost everything you need to know.

import pandas as pd

df = pd.read_csv('dataset.csv')

# Shape: how many rows and columns?
print(df.shape)              # (10000, 15)

# Column types and null counts
print(df.info())

# Statistical summary of numeric columns
print(df.describe())

# First few rows
print(df.head())

# Unique values per column (spot unexpected categories)
for col in df.select_dtypes(include='object'):
    print(f"{col}: {df[col].nunique()} unique — {df[col].unique()[:5]}")

Tip: Run df.info() first every time. It shows data types and null counts in one view. If a column that should be numeric shows as object, you have dirty values in it (like "$" or "N/A" strings).

Step 2: Handle Missing Values

Missing data is the most common issue. Your strategy depends on how much is missing and why it's missing.

Strategy	When to Use	Code
Drop rows	<5% missing, random	`df.dropna(subset=['col'])`
Drop column	>50% missing	`df.drop('col', axis=1)`
Fill with median	Numeric, has outliers	`df['col'].fillna(df['col'].median())`
Fill with mean	Numeric, normal distribution	`df['col'].fillna(df['col'].mean())`
Fill with mode	Categorical	`df['col'].fillna(df['col'].mode()[0])`
Forward fill	Time series	`df['col'].ffill()`
Flag + fill	Missingness is informative	Create `col_was_missing` boolean column

# Visualize missing data patterns
missing_pct = (df.isnull().sum() / len(df) * 100).sort_values(ascending=False)
print(missing_pct[missing_pct > 0])

# Strategy: create a "was_missing" flag, then impute
df['income_missing'] = df['income'].isnull().astype(int)
df['income'].fillna(df['income'].median(), inplace=True)

# For categorical columns, fill with the mode
df['city'].fillna(df['city'].mode()[0], inplace=True)

# For time series, use forward fill
df['temperature'].ffill(inplace=True)

Warning: Never impute the target variable (the column you're trying to predict). If the target is missing, drop the row — imputing it would be making up the answer.

Step 3: Fix Data Types

Wrong types silently break calculations. A "price" column stored as a string can't be summed or averaged.

# String prices → numeric (remove $, commas)
df['price'] = df['price'].str.replace('[$,]', '', regex=True).astype(float)

# String dates → datetime
df['date'] = pd.to_datetime(df['date'], format='%Y-%m-%d')

# Categories with few unique values → category type (saves memory)
df['status'] = df['status'].astype('category')

# Boolean strings → actual booleans
df['is_active'] = df['is_active'].map({'yes': True, 'no': False, 'Yes': True, 'No': False})

Tip: Use pd.to_numeric(df['col'], errors='coerce') to convert columns that have mixed types. The errors='coerce' turns unparseable values into NaN instead of throwing an error, so you can find and fix them.

Step 4: Remove Duplicates

Duplicates inflate your dataset and bias your model. Check for exact duplicates and near-duplicates (same entity, slightly different data).

# Exact duplicates
print(f"Duplicates: {df.duplicated().sum()}")
df.drop_duplicates(inplace=True)

# Duplicates on specific columns (e.g., same customer, same date)
df.drop_duplicates(subset=['customer_id', 'order_date'], keep='last', inplace=True)

# Near-duplicates: check for similar but not identical records
df[df.duplicated(subset=['name', 'email'], keep=False)].sort_values('name')

Info: The keep parameter controls which duplicate to keep: 'first' (default), 'last', or False (drop all duplicates).

Step 5: Handle Outliers

Outliers are extreme values that can skew your analysis and mislead your model. Detect them with IQR or Z-scores, then decide: keep, cap, or remove.

import numpy as np

# Method 1: IQR (Interquartile Range)
Q1 = df['salary'].quantile(0.25)
Q3 = df['salary'].quantile(0.75)
IQR = Q3 - Q1
lower = Q1 - 1.5 * IQR
upper = Q3 + 1.5 * IQR

outliers = df[(df['salary'] < lower) | (df['salary'] > upper)]
print(f"Outliers found: {len(outliers)}")

# Method 2: Z-score (assumes normal distribution)
from scipy import stats
z_scores = np.abs(stats.zscore(df['salary'].dropna()))
outlier_mask = z_scores > 3  # beyond 3 standard deviations

# Strategy: Cap (winsorize) instead of removing
df['salary'] = df['salary'].clip(lower=lower, upper=upper)

Strategy	When to Use	Pros	Cons
Remove	Data entry errors	Clean dataset	Lose information
Cap (winsorize)	Real but extreme values	Keeps all rows	Distorts extremes
Keep	Important signal (fraud)	Preserves truth	May skew models
Log transform	Skewed distributions	Reduces impact	Harder to interpret

Warning: Not all outliers are errors! A $10M sale might be unusual but real. Always check the business context before removing outliers.

Step 6: Reshape Data

Sometimes data arrives in the wrong shape. Use melt (wide → long) and pivot (long → wide) to restructure it for your analysis.

# Wide → Long (melt)
# Before: columns = ['name', 'jan_sales', 'feb_sales', 'mar_sales']
df_long = df.melt(
    id_vars=['name'],
    value_vars=['jan_sales', 'feb_sales', 'mar_sales'],
    var_name='month',
    value_name='sales'
)

# Long → Wide (pivot)
df_wide = df_long.pivot_table(
    index='name', columns='month', values='sales', aggfunc='sum'
).reset_index()

# GroupBy aggregation
summary = df.groupby('category').agg(
    total_sales=('sales', 'sum'),
    avg_price=('price', 'mean'),
    num_orders=('order_id', 'count')
).reset_index()

Tip: Use pivot_table instead of pivot when you have duplicate index/column combinations. pivot_table lets you specify an aggfunc to handle them.

Test Yourself

A column called "age" has 15% missing values. What strategy would you use and why?

With 15% missing, dropping rows would lose too much data. Best approach: impute with the median (robust to outliers in age data). If age is a strong predictor, also create an age_was_missing flag column — the fact that age is missing might itself be informative (e.g., users who skip optional fields).

What's the difference between df.drop_duplicates() and df.drop_duplicates(subset=['col'])?

df.drop_duplicates() removes rows where ALL columns are identical. df.drop_duplicates(subset=['col']) removes rows where only the specified column(s) are identical, keeping the first occurrence by default. Use subset when you want to deduplicate by a business key (like customer ID) rather than requiring an exact match on every field.

When should you use IQR vs Z-score for outlier detection?

IQR method: Works on any distribution (doesn't assume normality). Good default choice. Uses Q1 - 1.5*IQR and Q3 + 1.5*IQR as bounds. Z-score: Assumes data is roughly normal. Uses standard deviations from the mean (typically |z| > 3). Better for normally distributed data. If the data is heavily skewed, IQR is more reliable.

A "price" column reads as object type instead of float64. What's likely wrong and how do you fix it?

The column likely contains non-numeric characters like "$", commas, or text entries like "N/A". Fix: df['price'] = df['price'].str.replace('[$,]', '', regex=True) to strip symbols, then pd.to_numeric(df['price'], errors='coerce') to convert. The errors='coerce' turns unparseable values into NaN so you can find and handle them separately.

When would you use melt() vs pivot_table()?

melt() converts wide to long format — use when columns represent categories (like months) that should be rows. Example: columns jan_sales, feb_sales → one month column + one sales column. pivot_table() converts long to wide — use when you want to see values spread across columns. Example: one row per product with sales in separate month columns.

Interview Questions

How do you handle a column with 40% missing values?

Options: 1) If the column isn't important for your model, drop it. 2) If it's important, impute with median/mode and add a was_missing indicator column. 3) Use advanced imputation (KNN imputer or iterative imputer from scikit-learn) that predicts missing values from other columns. 4) Use algorithms that handle nulls natively (XGBoost, LightGBM). Always analyze why data is missing (MCAR, MAR, or MNAR) before choosing a strategy.

What's the difference between MCAR, MAR, and MNAR?

MCAR (Missing Completely at Random): Missingness is unrelated to any variable (rare). Safe to drop rows. MAR (Missing at Random): Missingness depends on other observed variables (e.g., younger people skip income questions). Imputation works well. MNAR (Missing Not at Random): Missingness depends on the missing value itself (e.g., high-income people don't report income). Hardest to handle — may need domain-specific approaches or sensitivity analysis.

How would you detect and handle data leakage during preprocessing?

Data leakage occurs when information from the test set "leaks" into training. Common during preprocessing: if you compute the mean on the entire dataset before splitting, test data influences training imputation. Fix: Always split first, then fit transformations only on training data. Use scikit-learn Pipeline with fit_transform on train and transform on test. Signs of leakage: suspiciously high accuracy that drops in production.

Data Wrangling & Preprocessing

The Wrangling Workflow

Step 1: Inspect Your Data

Step 2: Handle Missing Values

Step 3: Fix Data Types

Step 4: Remove Duplicates

Step 5: Handle Outliers

Step 6: Reshape Data

Test Yourself

Interview Questions

Related Topics