Pandas Deep Dive
Master Pandas: read CSV/Excel/SQL, filter rows, select columns, group and aggregate, merge/join DataFrames, handle missing data, and reshape with pivot/melt. Includes a SQL-to-Pandas cheat sheet so you can translate what you already know.
Explain Like I'm 12
Think of a Pandas DataFrame as a super-powered spreadsheet that lives inside Python. Each column has a name and each row has a number. You can filter rows like "show me only students with grade A," group rows like "what's the average score per class," and merge two spreadsheets together just like matching stickers by their ID numbers.
Instead of clicking buttons in Excel, you write short commands. The cool part? Those commands work on a file with 10 rows or 10 million rows — no scrolling required.
Reading Data
Everything starts with getting data into a DataFrame. Pandas can read almost any format.
df.head(), df.shape, and df.dtypes to catch problems early.
CSV files
import pandas as pd
# Basic read
df = pd.read_csv("sales.csv")
# Common options
df = pd.read_csv(
"sales.csv",
sep=",", # delimiter (use "\t" for TSV)
header=0, # row number for column names
index_col="id", # use a column as the index
usecols=["id", "product", "revenue"], # only load these columns
dtype={"id": str}, # force column types
parse_dates=["date"], # auto-parse date columns
na_values=["N/A", ""], # treat these as NaN
nrows=1000 # only read first 1000 rows (great for testing)
)
Excel files
# Reads the first sheet by default
df = pd.read_excel("report.xlsx")
# Read a specific sheet
df = pd.read_excel("report.xlsx", sheet_name="Q1 Sales")
# Read all sheets into a dict of DataFrames
all_sheets = pd.read_excel("report.xlsx", sheet_name=None)
SQL databases
import sqlite3
conn = sqlite3.connect("company.db")
# Read an entire table
df = pd.read_sql("SELECT * FROM employees", conn)
# Read with a query (filter at the database level for speed)
df = pd.read_sql(
"SELECT name, salary FROM employees WHERE department = 'Engineering'",
conn
)
conn.close()
read_json(), read_parquet(), read_html() (scrapes tables from web pages), and read_clipboard() (paste from Excel!).
Selecting Data
There are four main ways to grab data from a DataFrame. Each serves a different purpose.
Single column
# Returns a Series (1D)
df["name"]
# Dot notation (works only for simple column names without spaces)
df.name
Multiple columns
# Returns a DataFrame (2D) — note the double brackets
df[["name", "salary", "department"]]
loc — label-based selection
# Row by label, column by name
df.loc[0, "name"] # single value
df.loc[0:4, "name":"salary"] # slice (inclusive on both ends!)
df.loc[df["salary"] > 80000] # filter rows by condition
iloc — integer position-based selection
# Row by position, column by position
df.iloc[0, 1] # first row, second column
df.iloc[0:5, 0:3] # first 5 rows, first 3 columns (exclusive end)
df.iloc[-1] # last row
loc[0:4] includes row 4. iloc[0:4] excludes row 4. This trips up everyone at first. Think: loc = label (inclusive), iloc = index (Python-style exclusive).
Filtering Rows
Filtering is how you ask "show me only the rows where..." It is the Pandas equivalent of SQL's WHERE clause.
Boolean indexing
# Single condition
high_earners = df[df["salary"] > 100000]
# Multiple conditions — use & (and), | (or), ~ (not)
# MUST wrap each condition in parentheses
senior_engineers = df[
(df["department"] == "Engineering") &
(df["years_exp"] >= 5)
]
# Negate: everyone NOT in Engineering
non_eng = df[~(df["department"] == "Engineering")]
query() — the readable alternative
# Same result, cleaner syntax
senior_engineers = df.query("department == 'Engineering' and years_exp >= 5")
# You can reference variables with @
min_salary = 80000
df.query("salary >= @min_salary")
isin() — match against a list
# Like SQL's IN clause
target_depts = ["Engineering", "Product", "Design"]
df[df["department"].isin(target_depts)]
df["name"].str.contains("smith", case=False) for partial matches, or .str.startswith() / .str.endswith() for prefix/suffix matching.
Sorting
# Sort by one column (ascending by default)
df.sort_values("salary")
# Descending
df.sort_values("salary", ascending=False)
# Sort by multiple columns
df.sort_values(["department", "salary"], ascending=[True, False])
# Sort by index
df.sort_index()
na_position="first" to put them at the top.
GroupBy & Aggregation
GroupBy is the Pandas version of SQL's GROUP BY. The pattern is always: split (group rows) → apply (calculate) → combine (merge results).
Basic groupby + agg
# Average salary per department
df.groupby("department")["salary"].mean()
# Multiple aggregations
df.groupby("department")["salary"].agg(["mean", "median", "min", "max", "count"])
Multiple columns, different aggs
# Different function per column
df.groupby("department").agg(
avg_salary=("salary", "mean"),
total_headcount=("id", "count"),
max_experience=("years_exp", "max")
)
transform vs apply
# transform: returns same-size Series (great for adding a column)
df["dept_avg"] = df.groupby("department")["salary"].transform("mean")
# Now each row has its department's average next to its actual salary
df["above_avg"] = df["salary"] > df["dept_avg"]
# apply: flexible, returns any shape (slower)
def top_earner(group):
return group.nlargest(1, "salary")
df.groupby("department").apply(top_earner)
transform is vectorized and fast. apply loops over groups in Python and is much slower on large datasets. Prefer transform when possible.
Merging & Joining
Combining DataFrames is one of the most common Pandas operations. If you know SQL JOINs, you already understand the concept.
merge — the SQL-style way
# INNER JOIN (default) — only matching rows
result = pd.merge(employees, departments, left_on="dept_id", right_on="id")
# LEFT JOIN — all employees, even without a department
result = pd.merge(employees, departments, left_on="dept_id", right_on="id", how="left")
# RIGHT JOIN
result = pd.merge(employees, departments, left_on="dept_id", right_on="id", how="right")
# FULL OUTER JOIN
result = pd.merge(employees, departments, left_on="dept_id", right_on="id", how="outer")
# JOIN on same-name column
result = pd.merge(orders, customers, on="customer_id")
join — index-based shortcut
# When the join key is the index
employees.set_index("dept_id").join(departments.set_index("id"), how="left")
concat — stacking DataFrames
# Stack vertically (like UNION ALL)
all_data = pd.concat([q1_sales, q2_sales, q3_sales, q4_sales])
# Stack horizontally (side by side)
combined = pd.concat([df1, df2], axis=1)
indicator=True to pd.merge() to get a _merge column showing "left_only", "right_only", or "both" — great for debugging.
Handling Missing Data
Real-world data is messy. Pandas represents missing values as NaN (Not a Number). Here is your toolkit for dealing with them.
Detecting
# Check for nulls
df.isna() # Boolean DataFrame
df.isna().sum() # Count nulls per column
df.isna().sum().sum() # Total nulls in entire DataFrame
# Percentage of nulls
(df.isna().sum() / len(df) * 100).round(1)
Filling
# Fill with a constant
df["salary"].fillna(0)
# Fill with the column mean
df["salary"].fillna(df["salary"].mean())
# Forward fill (propagate last valid value)
df["price"].fillna(method="ffill")
# Interpolate (linear by default — great for time series)
df["temperature"].interpolate()
Dropping
# Drop rows with ANY null
df.dropna()
# Drop rows where a specific column is null
df.dropna(subset=["email"])
# Drop columns that are mostly null (> 50% missing)
threshold = len(df) * 0.5
df.dropna(axis=1, thresh=threshold)
Reshaping Data
Sometimes your data is in the wrong shape for your analysis. Pivot makes it wider; melt makes it longer.
pivot_table — long to wide
# Rows: department, Columns: year, Values: average salary
pivot = df.pivot_table(
values="salary",
index="department",
columns="year",
aggfunc="mean"
)
# Result: departments as rows, years as columns, cells = avg salary
melt — wide to long
# Turn columns into rows (opposite of pivot)
# Useful when each column is a year or category
long_df = pd.melt(
wide_df,
id_vars=["department"], # keep these as-is
value_vars=["2023", "2024"], # unpivot these columns
var_name="year",
value_name="revenue"
)
stack / unstack
# stack: columns → rows (adds a level to the index)
stacked = df.stack()
# unstack: rows → columns (removes a level from the index)
unstacked = df.unstack()
pivot_table for summary reports (like Excel pivot tables). Use melt to prepare data for visualization libraries that expect "tidy" long-format data.
Method Chaining
Instead of saving intermediate variables, chain operations together. This reads top-to-bottom, like a recipe.
result = (
df
.query("department == 'Engineering'")
.assign(
salary_k=lambda x: x["salary"] / 1000,
tenure=lambda x: 2026 - x["hire_year"]
)
.groupby("level")
.agg(
avg_salary_k=("salary_k", "mean"),
headcount=("id", "count")
)
.sort_values("avg_salary_k", ascending=False)
.reset_index()
)
.head(). You can also insert .pipe(lambda x: print(x.shape) or x) to log shapes mid-chain.
SQL-to-Pandas Cheat Sheet
If you know SQL, you know most of Pandas. This table maps the 15+ most common SQL operations to their Pandas equivalents.
| SQL | Pandas | Notes |
|---|---|---|
SELECT * |
df |
The whole DataFrame |
SELECT col1, col2 |
df[["col1", "col2"]] |
Double brackets return a DataFrame |
SELECT DISTINCT col |
df["col"].unique() |
Returns a NumPy array |
WHERE col = val |
df[df["col"] == val] |
Boolean indexing |
WHERE col IN (a, b) |
df[df["col"].isin([a, b])] |
Pass a list |
WHERE col LIKE '%text%' |
df[df["col"].str.contains("text")] |
Regex supported by default |
WHERE col IS NULL |
df[df["col"].isna()] |
Use notna() for IS NOT NULL |
ORDER BY col DESC |
df.sort_values("col", ascending=False) |
Ascending by default |
LIMIT 10 |
df.head(10) |
Or df.iloc[:10] |
GROUP BY col |
df.groupby("col") |
Chain with .agg() |
HAVING COUNT(*) > 5 |
.filter(lambda x: len(x) > 5) |
After groupby |
COUNT(*) |
df.groupby("col").size() |
Or len(df) for total |
SUM(col) |
df["col"].sum() |
Also: mean(), min(), max() |
INNER JOIN |
pd.merge(a, b, on="key") |
Default is inner |
LEFT JOIN |
pd.merge(a, b, on="key", how="left") |
Also: "right", "outer" |
UNION ALL |
pd.concat([a, b]) |
Use drop_duplicates() for UNION |
CASE WHEN |
np.where(cond, val1, val2) |
Or df["col"].map() for multiple cases |
Performance Tips
What to do
# GOOD: vectorized (fast)
df["bonus"] = df["salary"] * 0.1
# BAD: iterrows (slow — loops in Python)
for idx, row in df.iterrows():
df.at[idx, "bonus"] = row["salary"] * 0.1
More performance tips
- Read only what you need: Use
usecolsandnrowsinread_csv(). - Downcast types:
df["id"] = df["id"].astype("int32")cuts memory in half vs int64. - Use categoricals:
df["status"] = df["status"].astype("category")for columns with few unique values. - Filter before groupby: Reduce rows first, then aggregate — just like SQL.
- Avoid apply when possible: Use built-in methods (
str,dtaccessors) instead of.apply(lambda).
df.info(memory_usage="deep") shows actual RAM usage. If your DataFrame is bigger than available RAM, look into dask, polars, or process in chunks with read_csv(chunksize=10000).
Test Yourself
Q: What is the difference between df.loc[] and df.iloc[]?
loc selects by label (column names, index labels) and is inclusive on both ends. iloc selects by integer position and is exclusive on the end, like standard Python slicing.Q: How would you load only the first 500 rows and columns "id", "name", "score" from a CSV file?
pd.read_csv("file.csv", usecols=["id", "name", "score"], nrows=500). Loading only what you need is faster and uses less memory.Q: What does df.groupby("dept")["salary"].transform("mean") do that .agg("mean") does not?
transform returns a Series the same length as the original DataFrame (one mean value per row, matching its group). agg returns a collapsed result with one row per group. Use transform when you want to add a group-level value back as a column.Q: You have two DataFrames, orders and customers, both with a "customer_id" column. How do you get all orders with customer details, including orders whose customer is missing from the customers table?
pd.merge(orders, customers, on="customer_id", how="left"). A left merge keeps all rows from the left DataFrame (orders) and fills NaN where no matching customer exists.Q: Why should you avoid iterrows() for calculations on large DataFrames?
iterrows() loops row-by-row in Python, which is extremely slow. Vectorized Pandas/NumPy operations run in optimized C code and can be 100x+ faster. Instead of looping, use column arithmetic (df["a"] + df["b"]), np.where(), or .apply() as a last resort.Interview Questions
Q: Explain the difference between a Pandas Series and a DataFrame.
df["name"]. You get a DataFrame when you select multiple: df[["name", "age"]].Q: You are given a DataFrame with 10 million rows and your Pandas code is slow. Walk through how you would optimize it.
iterrows()/apply() with vectorized ops. (3) Downcast dtypes (int64→int32, object→category). (4) Filter early to reduce row count. (5) Use usecols to load only needed columns. (6) Consider chunked processing (read_csv(chunksize=N)). (7) If still too slow, switch to Polars or Dask for parallel processing.Q: What is the Pandas equivalent of this SQL query: SELECT department, AVG(salary) FROM employees WHERE hire_year >= 2020 GROUP BY department HAVING AVG(salary) > 80000?
(
df[df["hire_year"] >= 2020]
.groupby("department")["salary"]
.mean()
.loc[lambda x: x > 80000]
)
Filter first (WHERE), then group (GROUP BY), then filter groups (HAVING). The .loc[lambda] pattern replaces HAVING.
Q: How does pd.merge() handle duplicate keys? What problems can this cause?
len(result) after a merge and use validate="one_to_many" or validate="one_to_one" to catch unexpected duplicates.