Pandas Deep Dive

Q: What is the difference between df.loc[] and df.iloc[]?

loc selects by label (column names, index labels) and is inclusive on both ends. iloc selects by integer position and is exclusive on the end, like standard Python slicing.

Q: What does df.groupby("dept")["salary"].transform("mean") do that .agg("mean") does not?

transform returns a Series the same length as the original DataFrame (one mean value per row, matching its group). agg returns a collapsed result with one row per group. Use transform when you want to add a group-level value back as a column.

Q: You have two DataFrames, orders and customers, both with a "customer_id" column. How do you get all orders with customer details, including orders whose customer is missing from the customers table?

pd.merge(orders, customers, on="customer_id", how="left"). A left merge keeps all rows from the left DataFrame (orders) and fills NaN where no matching customer exists.

Q: Why should you avoid iterrows() for calculations on large DataFrames?

iterrows() loops row-by-row in Python, which is extremely slow. Vectorized Pandas/NumPy operations run in optimized C code and can be 100x+ faster. Instead of looping, use column arithmetic (df["a"] + df["b"]), np.where(), or .apply() as a last resort.

Q: Explain the difference between a Pandas Series and a DataFrame.

A Series is 1-dimensional (like a single column). A DataFrame is 2-dimensional (like a table with rows and columns). A DataFrame is essentially a dictionary of Series that share the same index. You get a Series when you select one column: df["name"]. You get a DataFrame when you select multiple: df[["name", "age"]].

Q: You are given a DataFrame with 10 million rows and your Pandas code is slow. Walk through how you would optimize it.

(1) Profile first: identify the slow line. (2) Replace iterrows()/apply() with vectorized ops. (3) Downcast dtypes (int64→int32, object→category). (4) Filter early to reduce row count. (5) Use usecols to load only needed columns. (6) Consider chunked processing (read_csv(chunksize=N)). (7) If still too slow, switch to Polars or Dask for parallel processing.

Q: What is the Pandas equivalent of this SQL query: SELECT department, AVG(salary) FROM employees WHERE hire_year >= 2020 GROUP BY department HAVING AVG(salary) > 80000?

( df[df["hire_year"] >= 2020] .groupby("department")["salary"] .mean() .loc[lambda x: x > 80000] ) Filter first (WHERE), then group (GROUP BY), then filter groups (HAVING). The .loc[lambda] pattern replaces HAVING.

By QuickLearnPro Editorial · Editorial standards

TL;DR

Master Pandas: read CSV/Excel/SQL, filter rows, select columns, group and aggregate, merge/join DataFrames, handle missing data, and reshape with pivot/melt. Includes a SQL-to-Pandas cheat sheet so you can translate what you already know.

Explain Like I'm 12

Think of a Pandas DataFrame as a super-powered spreadsheet that lives inside Python. Each column has a name and each row has a number. You can filter rows like "show me only students with grade A," group rows like "what's the average score per class," and merge two spreadsheets together just like matching stickers by their ID numbers.

Instead of clicking buttons in Excel, you write short commands. The cool part? Those commands work on a file with 10 rows or 10 million rows — no scrolling required.

Reading Data

Everything starts with getting data into a DataFrame. Pandas can read almost any format.

Pro tip: Always inspect your data right after loading. Run df.head(), df.shape, and df.dtypes to catch problems early.

CSV files

import pandas as pd

# Basic read
df = pd.read_csv("sales.csv")

# Common options
df = pd.read_csv(
    "sales.csv",
    sep=",",              # delimiter (use "\t" for TSV)
    header=0,             # row number for column names
    index_col="id",       # use a column as the index
    usecols=["id", "product", "revenue"],  # only load these columns
    dtype={"id": str},    # force column types
    parse_dates=["date"], # auto-parse date columns
    na_values=["N/A", ""], # treat these as NaN
    nrows=1000            # only read first 1000 rows (great for testing)
)

Excel files

# Reads the first sheet by default
df = pd.read_excel("report.xlsx")

# Read a specific sheet
df = pd.read_excel("report.xlsx", sheet_name="Q1 Sales")

# Read all sheets into a dict of DataFrames
all_sheets = pd.read_excel("report.xlsx", sheet_name=None)

SQL databases

import sqlite3

conn = sqlite3.connect("company.db")

# Read an entire table
df = pd.read_sql("SELECT * FROM employees", conn)

# Read with a query (filter at the database level for speed)
df = pd.read_sql(
    "SELECT name, salary FROM employees WHERE department = 'Engineering'",
    conn
)

conn.close()

Other formats: Pandas also supports read_json(), read_parquet(), read_html() (scrapes tables from web pages), and read_clipboard() (paste from Excel!).

Selecting Data

There are four main ways to grab data from a DataFrame. Each serves a different purpose.

Single column

# Returns a Series (1D)
df["name"]

# Dot notation (works only for simple column names without spaces)
df.name

Multiple columns

# Returns a DataFrame (2D) — note the double brackets
df[["name", "salary", "department"]]

loc — label-based selection

# Row by label, column by name
df.loc[0, "name"]               # single value
df.loc[0:4, "name":"salary"]    # slice (inclusive on both ends!)
df.loc[df["salary"] > 80000]    # filter rows by condition

iloc — integer position-based selection

# Row by position, column by position
df.iloc[0, 1]             # first row, second column
df.iloc[0:5, 0:3]         # first 5 rows, first 3 columns (exclusive end)
df.iloc[-1]               # last row

loc vs iloc: loc[0:4] includes row 4. iloc[0:4] excludes row 4. This trips up everyone at first. Think: loc = label (inclusive), iloc = index (Python-style exclusive).

Filtering Rows

Filtering is how you ask "show me only the rows where..." It is the Pandas equivalent of SQL's WHERE clause.

Boolean indexing

# Single condition
high_earners = df[df["salary"] > 100000]

# Multiple conditions — use & (and), | (or), ~ (not)
# MUST wrap each condition in parentheses
senior_engineers = df[
    (df["department"] == "Engineering") &
    (df["years_exp"] >= 5)
]

# Negate: everyone NOT in Engineering
non_eng = df[~(df["department"] == "Engineering")]

query() — the readable alternative

# Same result, cleaner syntax
senior_engineers = df.query("department == 'Engineering' and years_exp >= 5")

# You can reference variables with @
min_salary = 80000
df.query("salary >= @min_salary")

isin() — match against a list

# Like SQL's IN clause
target_depts = ["Engineering", "Product", "Design"]
df[df["department"].isin(target_depts)]

String filtering: Use df["name"].str.contains("smith", case=False) for partial matches, or .str.startswith() / .str.endswith() for prefix/suffix matching.

Sorting

# Sort by one column (ascending by default)
df.sort_values("salary")

# Descending
df.sort_values("salary", ascending=False)

# Sort by multiple columns
df.sort_values(["department", "salary"], ascending=[True, False])

# Sort by index
df.sort_index()

NaN handling: By default, NaN values sort to the end. Use na_position="first" to put them at the top.

GroupBy & Aggregation

GroupBy is the Pandas version of SQL's GROUP BY. The pattern is always: split (group rows) → apply (calculate) → combine (merge results).

Basic groupby + agg

# Average salary per department
df.groupby("department")["salary"].mean()

# Multiple aggregations
df.groupby("department")["salary"].agg(["mean", "median", "min", "max", "count"])

Multiple columns, different aggs

# Different function per column
df.groupby("department").agg(
    avg_salary=("salary", "mean"),
    total_headcount=("id", "count"),
    max_experience=("years_exp", "max")
)

transform vs apply

# transform: returns same-size Series (great for adding a column)
df["dept_avg"] = df.groupby("department")["salary"].transform("mean")

# Now each row has its department's average next to its actual salary
df["above_avg"] = df["salary"] > df["dept_avg"]

# apply: flexible, returns any shape (slower)
def top_earner(group):
    return group.nlargest(1, "salary")

df.groupby("department").apply(top_earner)

Performance: transform is vectorized and fast. apply loops over groups in Python and is much slower on large datasets. Prefer transform when possible.

Merging & Joining

Combining DataFrames is one of the most common Pandas operations. If you know SQL JOINs, you already understand the concept.

merge — the SQL-style way

# INNER JOIN (default) — only matching rows
result = pd.merge(employees, departments, left_on="dept_id", right_on="id")

# LEFT JOIN — all employees, even without a department
result = pd.merge(employees, departments, left_on="dept_id", right_on="id", how="left")

# RIGHT JOIN
result = pd.merge(employees, departments, left_on="dept_id", right_on="id", how="right")

# FULL OUTER JOIN
result = pd.merge(employees, departments, left_on="dept_id", right_on="id", how="outer")

# JOIN on same-name column
result = pd.merge(orders, customers, on="customer_id")

join — index-based shortcut

# When the join key is the index
employees.set_index("dept_id").join(departments.set_index("id"), how="left")

concat — stacking DataFrames

# Stack vertically (like UNION ALL)
all_data = pd.concat([q1_sales, q2_sales, q3_sales, q4_sales])

# Stack horizontally (side by side)
combined = pd.concat([df1, df2], axis=1)

Merge indicator: Add indicator=True to pd.merge() to get a _merge column showing "left_only", "right_only", or "both" — great for debugging.

Handling Missing Data

Real-world data is messy. Pandas represents missing values as NaN (Not a Number). Here is your toolkit for dealing with them.

Detecting

# Check for nulls
df.isna()                  # Boolean DataFrame
df.isna().sum()            # Count nulls per column
df.isna().sum().sum()      # Total nulls in entire DataFrame

# Percentage of nulls
(df.isna().sum() / len(df) * 100).round(1)

Filling

# Fill with a constant
df["salary"].fillna(0)

# Fill with the column mean
df["salary"].fillna(df["salary"].mean())

# Forward fill (propagate last valid value)
df["price"].fillna(method="ffill")

# Interpolate (linear by default — great for time series)
df["temperature"].interpolate()

Dropping

# Drop rows with ANY null
df.dropna()

# Drop rows where a specific column is null
df.dropna(subset=["email"])

# Drop columns that are mostly null (> 50% missing)
threshold = len(df) * 0.5
df.dropna(axis=1, thresh=threshold)

Never blindly fill or drop. Ask yourself: why is this data missing? If it is missing at random, filling with the mean is fine. If it is missing systematically (e.g., high earners decline to report salary), filling biases your analysis.

Reshaping Data

Sometimes your data is in the wrong shape for your analysis. Pivot makes it wider; melt makes it longer.

pivot_table — long to wide

# Rows: department, Columns: year, Values: average salary
pivot = df.pivot_table(
    values="salary",
    index="department",
    columns="year",
    aggfunc="mean"
)
# Result: departments as rows, years as columns, cells = avg salary

melt — wide to long

# Turn columns into rows (opposite of pivot)
# Useful when each column is a year or category
long_df = pd.melt(
    wide_df,
    id_vars=["department"],       # keep these as-is
    value_vars=["2023", "2024"],  # unpivot these columns
    var_name="year",
    value_name="revenue"
)

stack / unstack

# stack: columns → rows (adds a level to the index)
stacked = df.stack()

# unstack: rows → columns (removes a level from the index)
unstacked = df.unstack()

When to use which: Use pivot_table for summary reports (like Excel pivot tables). Use melt to prepare data for visualization libraries that expect "tidy" long-format data.

Method Chaining

Instead of saving intermediate variables, chain operations together. This reads top-to-bottom, like a recipe.

result = (
    df
    .query("department == 'Engineering'")
    .assign(
        salary_k=lambda x: x["salary"] / 1000,
        tenure=lambda x: 2026 - x["hire_year"]
    )
    .groupby("level")
    .agg(
        avg_salary_k=("salary_k", "mean"),
        headcount=("id", "count")
    )
    .sort_values("avg_salary_k", ascending=False)
    .reset_index()
)

Debugging chains: When a chain gives unexpected results, break it at each step and inspect with .head(). You can also insert .pipe(lambda x: print(x.shape) or x) to log shapes mid-chain.

SQL-to-Pandas Cheat Sheet

If you know SQL, you know most of Pandas. This table maps the 15+ most common SQL operations to their Pandas equivalents.

SQL	Pandas	Notes
`SELECT *`	`df`	The whole DataFrame
`SELECT col1, col2`	`df[["col1", "col2"]]`	Double brackets return a DataFrame
`SELECT DISTINCT col`	`df["col"].unique()`	Returns a NumPy array
`WHERE col = val`	`df[df["col"] == val]`	Boolean indexing
`WHERE col IN (a, b)`	`df[df["col"].isin([a, b])]`	Pass a list
`WHERE col LIKE '%text%'`	`df[df["col"].str.contains("text")]`	Regex supported by default
`WHERE col IS NULL`	`df[df["col"].isna()]`	Use `notna()` for IS NOT NULL
`ORDER BY col DESC`	`df.sort_values("col", ascending=False)`	Ascending by default
`LIMIT 10`	`df.head(10)`	Or `df.iloc[:10]`
`GROUP BY col`	`df.groupby("col")`	Chain with `.agg()`
`HAVING COUNT(*) > 5`	`.filter(lambda x: len(x) > 5)`	After groupby
`COUNT(*)`	`df.groupby("col").size()`	Or `len(df)` for total
`SUM(col)`	`df["col"].sum()`	Also: `mean()`, `min()`, `max()`
`INNER JOIN`	`pd.merge(a, b, on="key")`	Default is inner
`LEFT JOIN`	`pd.merge(a, b, on="key", how="left")`	Also: "right", "outer"
`UNION ALL`	`pd.concat([a, b])`	Use `drop_duplicates()` for UNION
`CASE WHEN`	`np.where(cond, val1, val2)`	Or `df["col"].map()` for multiple cases

Performance Tips

Rule #1: Use vectorized operations. A single Pandas/NumPy operation on a million rows is 100x faster than a Python for-loop over those same rows.

What to do

# GOOD: vectorized (fast)
df["bonus"] = df["salary"] * 0.1

# BAD: iterrows (slow — loops in Python)
for idx, row in df.iterrows():
    df.at[idx, "bonus"] = row["salary"] * 0.1

More performance tips

Read only what you need: Use usecols and nrows in read_csv().
Downcast types: df["id"] = df["id"].astype("int32") cuts memory in half vs int64.
Use categoricals: df["status"] = df["status"].astype("category") for columns with few unique values.
Filter before groupby: Reduce rows first, then aggregate — just like SQL.
Avoid apply when possible: Use built-in methods (str, dt accessors) instead of .apply(lambda).

Memory watch: df.info(memory_usage="deep") shows actual RAM usage. If your DataFrame is bigger than available RAM, look into dask, polars, or process in chunks with read_csv(chunksize=10000).

Test Yourself

Q: What is the difference between df.loc[] and df.iloc[]?

loc selects by label (column names, index labels) and is inclusive on both ends. iloc selects by integer position and is exclusive on the end, like standard Python slicing.

Q: How would you load only the first 500 rows and columns "id", "name", "score" from a CSV file?

pd.read_csv("file.csv", usecols=["id", "name", "score"], nrows=500). Loading only what you need is faster and uses less memory.

Q: What does df.groupby("dept")["salary"].transform("mean") do that .agg("mean") does not?

transform returns a Series the same length as the original DataFrame (one mean value per row, matching its group). agg returns a collapsed result with one row per group. Use transform when you want to add a group-level value back as a column.

Q: You have two DataFrames, orders and customers, both with a "customer_id" column. How do you get all orders with customer details, including orders whose customer is missing from the customers table?

pd.merge(orders, customers, on="customer_id", how="left"). A left merge keeps all rows from the left DataFrame (orders) and fills NaN where no matching customer exists.

Q: Why should you avoid iterrows() for calculations on large DataFrames?

iterrows() loops row-by-row in Python, which is extremely slow. Vectorized Pandas/NumPy operations run in optimized C code and can be 100x+ faster. Instead of looping, use column arithmetic (df["a"] + df["b"]), np.where(), or .apply() as a last resort.

Interview Questions

Q: Explain the difference between a Pandas Series and a DataFrame.

A Series is 1-dimensional (like a single column). A DataFrame is 2-dimensional (like a table with rows and columns). A DataFrame is essentially a dictionary of Series that share the same index. You get a Series when you select one column: df["name"]. You get a DataFrame when you select multiple: df[["name", "age"]].

Q: You are given a DataFrame with 10 million rows and your Pandas code is slow. Walk through how you would optimize it.

(1) Profile first: identify the slow line. (2) Replace iterrows()/apply() with vectorized ops. (3) Downcast dtypes (int64→int32, object→category). (4) Filter early to reduce row count. (5) Use usecols to load only needed columns. (6) Consider chunked processing (read_csv(chunksize=N)). (7) If still too slow, switch to Polars or Dask for parallel processing.

Q: What is the Pandas equivalent of this SQL query: SELECT department, AVG(salary) FROM employees WHERE hire_year >= 2020 GROUP BY department HAVING AVG(salary) > 80000?

(
    df[df["hire_year"] >= 2020]
    .groupby("department")["salary"]
    .mean()
    .loc[lambda x: x > 80000]
)

Filter first (WHERE), then group (GROUP BY), then filter groups (HAVING). The .loc[lambda] pattern replaces HAVING.

Q: How does pd.merge() handle duplicate keys? What problems can this cause?

When both DataFrames have duplicate keys, Pandas produces a Cartesian product for those keys. If the left has 3 rows with key "A" and the right has 2 rows with key "A", you get 6 rows for "A". This can silently explode your DataFrame size. Always check len(result) after a merge and use validate="one_to_many" or validate="one_to_one" to catch unexpected duplicates.

Pandas Deep Dive

Reading Data

CSV files

Excel files

SQL databases

Selecting Data

Single column

Multiple columns

loc — label-based selection

iloc — integer position-based selection

Filtering Rows

Boolean indexing

query() — the readable alternative

isin() — match against a list

Sorting

GroupBy & Aggregation

Basic groupby + agg

Multiple columns, different aggs

transform vs apply

Merging & Joining

merge — the SQL-style way

join — index-based shortcut

concat — stacking DataFrames

Handling Missing Data

Detecting

Filling

Dropping

Reshaping Data

pivot_table — long to wide

melt — wide to long

stack / unstack

Method Chaining

SQL-to-Pandas Cheat Sheet

Performance Tips

What to do

More performance tips

Test Yourself

Interview Questions

Related Topics