Python for Analytics Interview Questions
30+ Python for Data Analytics interview questions with hidden answers, organized by topic. Click "Show Answer" to reveal. Perfect for a quick 15-minute revision before an interview.
Pandas Basics
Q: What is the difference between a Pandas Series and a DataFrame?
df["col"] returns a Series; df[["col1","col2"]] returns a DataFrame.Q: Name 3 ways to read data into a Pandas DataFrame.
pd.read_csv() for CSV files, pd.read_excel() for Excel files, pd.read_sql() for databases. Others: read_json(), read_parquet(), read_html() (scrapes tables from web pages), read_clipboard() (paste from Excel).Q: What is the difference between df.loc[] and df.iloc[]?
loc selects by label (column names, index labels) and the end is inclusive. iloc selects by integer position and the end is exclusive (standard Python slicing). df.loc[0:3] returns 4 rows. df.iloc[0:3] returns 3 rows.Q: How do you check the data types of each column? How do you change a column's type?
df.dtypes or df.info(). Change: df["col"] = df["col"].astype("int32"). For dates: df["date"] = pd.to_datetime(df["date"]). For categorical: df["status"] = df["status"].astype("category"). Wrong dtypes are the #1 silent bug in data analysis.Q: What is the index in a DataFrame? When would you set a custom index?
df.set_index("col") when you want fast lookups by that column, or when using .loc[] for label-based access. Common examples: using dates as the index for time series, or user_id for quick user lookups. Reset with df.reset_index().Q: How do you filter rows where a column matches one of several values?
isin(): df[df["department"].isin(["Engineering", "Product", "Design"])]. This is the Pandas equivalent of SQL's WHERE department IN ('Engineering', 'Product', 'Design'). For the opposite (NOT IN), use ~df["department"].isin([...]).Deeper coverage: Pandas Deep Dive
Data Manipulation
Q: Explain how groupby() works. What are the 3 steps?
df.groupby("dept")["salary"].mean() splits by department, calculates the mean salary for each, and returns a Series with one value per department.Q: What is the difference between pd.merge() with how="inner" vs how="left"?
Q: How does pivot_table() work? Give a real example.
df.pivot_table(
values="revenue", # what to calculate
index="region", # rows
columns="quarter", # columns
aggfunc="sum" # aggregation function
)
Result: regions as rows, quarters as columns, cells = total revenue. Like an Excel pivot table.Q: A DataFrame has missing values. Walk through your decision process for handling them.
df.isna().sum() — how many nulls per column? (2) Understand why: missing at random? or systematically? (3) Decide: drop if few nulls and random (dropna()); fill with mean/median if numeric and random (fillna()); forward-fill for time series (ffill); use a flag column if missingness is informative (df["col_missing"] = df["col"].isna()). (4) Never blindly fill — filling salary with mean when high earners declined to report biases the analysis.Q: What is method chaining in Pandas? Why is it useful?
result = (
df
.query("year >= 2024")
.groupby("dept")["salary"]
.mean()
.sort_values(ascending=False)
)
Benefits: reads top-to-bottom like a recipe, no clutter of temp variables, easy to add/remove steps. Use .assign() to create columns and .pipe() for custom functions mid-chain.Q: What is the difference between concat() and merge()?
concat() stacks DataFrames vertically (like SQL UNION) or horizontally by position. merge() combines DataFrames based on a shared key column (like SQL JOIN). Use concat when you have the same columns and want to stack rows. Use merge when you want to enrich one DataFrame with columns from another based on a matching key.Deeper coverage: Pandas Deep Dive
NumPy
Q: What is the difference between a Python list and a NumPy array?
[1,2,3] * 2 gives [1,2,3,1,2,3] (list repeat). np.array([1,2,3]) * 2 gives [2,4,6] (element-wise).Q: What is broadcasting in NumPy? Give an example.
np.array([1,2,3]) + 10 → [11,12,13]. The scalar 10 is "broadcast" across all elements. Works for matrices too: a (3x3) matrix + a (1x3) row adds that row to every row of the matrix. No loops needed.Q: Why is vectorization important? How does it relate to Pandas performance?
df["salary"] * 1.1 is vectorized — one call to C handles all rows. for idx, row in df.iterrows(): row["salary"] * 1.1 loops in Python — 100x slower. Pandas is built on NumPy, so every vectorized Pandas op is actually a vectorized NumPy op underneath.Q: How do you generate reproducible random numbers with NumPy?
np.random.seed(42) (legacy) or use the newer Generator API:
rng = np.random.default_rng(seed=42)
samples = rng.normal(loc=0, scale=1, size=1000)
Same seed = same sequence every time. Essential for reproducible analysis and testing.Deeper coverage: Core Concepts
Visualization
Q: When should you use a bar chart vs a line chart vs a scatter plot?
Q: What is the difference between Matplotlib and Seaborn?
Q: How do you make a correlation heatmap in Seaborn?
import seaborn as sns
corr = df.corr() # compute correlation matrix
sns.heatmap(corr, annot=True, fmt=".2f",
cmap="coolwarm", center=0)
plt.title("Correlation Matrix")
plt.show()
annot=True prints values. center=0 makes colors symmetric. Look for values near +1 or -1 for strong relationships.Q: How would you customize a Matplotlib chart for a professional presentation?
fig, ax = plt.subplots(figsize=(10, 6)). (2) Add labels: ax.set_title(), ax.set_xlabel(), ax.set_ylabel(). (3) Style: remove top/right spines, add subtle grid, use brand colors. (4) Legend: ax.legend(). (5) Layout: plt.tight_layout(). (6) Save at high DPI: fig.savefig("chart.png", dpi=150, bbox_inches="tight").Deeper coverage: Data Visualization Deep Dive
SQL-to-Pandas Translation
Q: Translate to Pandas: SELECT name, salary FROM employees WHERE department = 'Engineering' ORDER BY salary DESC LIMIT 10
(
df[df["department"] == "Engineering"][["name", "salary"]]
.sort_values("salary", ascending=False)
.head(10)
)
Or with query: df.query("department == 'Engineering'")[["name","salary"]].nlargest(10, "salary")
Q: Translate to Pandas: SELECT department, COUNT(*), AVG(salary) FROM employees GROUP BY department HAVING COUNT(*) > 5
result = (
df.groupby("department")
.agg(count=("id", "count"), avg_salary=("salary", "mean"))
.query("count > 5")
)
The .query("count > 5") is the HAVING equivalent — it filters after grouping.
Q: Translate to Pandas: SELECT e.name, d.dept_name FROM employees e LEFT JOIN departments d ON e.dept_id = d.id
result = pd.merge(
employees, departments,
left_on="dept_id", right_on="id",
how="left"
)[["name", "dept_name"]]
Key: how="left" keeps all employees, even those without a matching department (NaN in dept_name).
Q: Translate to Pandas: SELECT *, RANK() OVER (PARTITION BY department ORDER BY salary DESC) as rank FROM employees
df["rank"] = (
df.groupby("department")["salary"]
.rank(method="min", ascending=False)
.astype(int)
)
Pandas rank(method="min") mirrors SQL RANK(). Use method="dense" for DENSE_RANK() and method="first" for ROW_NUMBER().
Full comparison table: SQL-to-Pandas Cheat Sheet
Scenario Questions
Q: You receive a messy CSV with mixed date formats, duplicate rows, inconsistent casing in category columns, and 15% missing values in the "revenue" column. Walk through how you would clean it.
# 1. Load and inspect
df = pd.read_csv("messy.csv")
print(df.shape, df.dtypes, df.isna().sum())
# 2. Fix dates
df["date"] = pd.to_datetime(df["date"], format="mixed")
# 3. Remove duplicates
df = df.drop_duplicates()
# 4. Standardize categories
df["category"] = df["category"].str.strip().str.lower()
# 5. Handle missing revenue
# Check if random or systematic first
df["revenue"] = df["revenue"].fillna(df.groupby("category")["revenue"].transform("median"))
# 6. Verify
print(df.isna().sum()) # should be 0
print(df.dtypes) # dates should be datetime64
Q: Your manager asks for a weekly sales report. You have a DataFrame with daily sales. How would you build this?
# Ensure date column is datetime and set as index
df["date"] = pd.to_datetime(df["date"])
df = df.set_index("date")
# Resample to weekly frequency
weekly = df.resample("W").agg(
total_sales=("revenue", "sum"),
avg_order_value=("revenue", "mean"),
num_orders=("order_id", "count")
)
# Add week-over-week change
weekly["wow_change"] = weekly["total_sales"].pct_change() * 100
# Visualize
weekly["total_sales"].plot(kind="bar", figsize=(12, 5),
title="Weekly Sales")
plt.tight_layout()
plt.show()
Q: You notice a sudden spike in website errors last Tuesday. You have a DataFrame with columns: timestamp, error_type, endpoint, user_id. How would you investigate?
# 1. Isolate the spike day
tuesday = df[df["timestamp"].dt.date == pd.Timestamp("2026-03-24").date()]
# 2. What types of errors?
tuesday["error_type"].value_counts()
# 3. Which endpoints are affected?
tuesday.groupby(["endpoint", "error_type"]).size().sort_values(ascending=False).head(10)
# 4. Timeline — when exactly did it start?
tuesday.set_index("timestamp").resample("1h").size().plot(title="Errors per Hour")
# 5. Is it one user or many?
tuesday["user_id"].nunique() # vs total errors
# 6. Compare to a normal day
normal = df[df["timestamp"].dt.date == pd.Timestamp("2026-03-17").date()]
print(f"Tuesday: {len(tuesday)} errors, Normal: {len(normal)} errors")
Approach: quantify the spike, break down by dimension (type, endpoint, time, user), compare to baseline.
Q: Your Pandas code takes 10 minutes to run on a 5 million row DataFrame. How would you optimize it?
%%timeit or line_profiler to find the slow line.
(2) Replace loops: iterrows() and apply() are the usual culprits. Rewrite with vectorized ops.
(3) Downcast types: int64 → int32, float64 → float32, object → category. Check with df.info(memory_usage="deep").
(4) Filter early: remove unneeded rows/columns before heavy operations.
(5) Use efficient methods: nlargest() instead of sort_values().head().
(6) Read selectively: usecols, nrows, or switch to Parquet (columnar, compressed).
(7) Last resort: switch to Polars (Rust-based, 5-10x faster) or Dask (parallelized Pandas).
Q: You have two DataFrames: last month's product catalog and this month's product catalog. How would you find products that were added, removed, and price-changed?
# Merge with indicator to see which rows match
comparison = pd.merge(
last_month, this_month,
on="product_id", how="outer",
suffixes=("_old", "_new"),
indicator=True
)
# New products (only in this month)
added = comparison[comparison["_merge"] == "right_only"]
# Removed products (only in last month)
removed = comparison[comparison["_merge"] == "left_only"]
# Price changes (in both, but different price)
both = comparison[comparison["_merge"] == "both"]
price_changed = both[both["price_old"] != both["price_new"]]
price_changed["price_diff"] = price_changed["price_new"] - price_changed["price_old"]
print(f"Added: {len(added)}, Removed: {len(removed)}, Price changed: {len(price_changed)}")
The indicator=True trick is the key — it tells you where each row came from.
Q: You are asked to analyze user retention. You have a DataFrame with user_id and login_date. How would you calculate the percentage of users who returned in their second month?
# 1. Find each user's first login month
df["login_month"] = df["login_date"].dt.to_period("M")
first_month = df.groupby("user_id")["login_month"].min().rename("cohort")
# 2. Merge cohort back
df = df.merge(first_month, on="user_id")
# 3. Calculate months since first login
df["months_since"] = (df["login_month"] - df["cohort"]).apply(lambda x: x.n)
# 4. Count unique users per cohort per month
cohort_data = df.groupby(["cohort", "months_since"])["user_id"].nunique().unstack()
# 5. Calculate retention rate
retention = cohort_data.div(cohort_data[0], axis=0) * 100
print(f"Month-1 retention: {retention[1].mean():.1f}%")
This is a cohort analysis — group users by their signup month, then track what percentage are still active in subsequent months.