Python for Analytics Interview Questions

Q: What is the difference between a Pandas Series and a DataFrame?

A Series is 1-dimensional (a single column with an index). A DataFrame is 2-dimensional (a table with rows and columns). A DataFrame is essentially a dict of Series sharing the same index. df["col"] returns a Series; df[["col1","col2"]] returns a DataFrame.

Q: Name 3 ways to read data into a Pandas DataFrame.

pd.read_csv() for CSV files, pd.read_excel() for Excel files, pd.read_sql() for databases. Others: read_json(), read_parquet(), read_html() (scrapes tables from web pages), read_clipboard() (paste from Excel).

Q: What is the difference between df.loc[] and df.iloc[]?

loc selects by label (column names, index labels) and the end is inclusive. iloc selects by integer position and the end is exclusive (standard Python slicing). df.loc[0:3] returns 4 rows. df.iloc[0:3] returns 3 rows.

Q: How do you check the data types of each column? How do you change a column's type?

Check: df.dtypes or df.info(). Change: df["col"] = df["col"].astype("int32"). For dates: df["date"] = pd.to_datetime(df["date"]). For categorical: df["status"] = df["status"].astype("category"). Wrong dtypes are the #1 silent bug in data analysis.

Q: How do you filter rows where a column matches one of several values?

Use isin(): df[df["department"].isin(["Engineering", "Product", "Design"])]. This is the Pandas equivalent of SQL's WHERE department IN ('Engineering', 'Product', 'Design'). For the opposite (NOT IN), use ~df["department"].isin([...]).

Q: Explain how groupby() works. What are the 3 steps?

GroupBy follows a split-apply-combine pattern: (1) Split the data into groups by key. (2) Apply a function to each group (mean, sum, custom). (3) Combine results into a new DataFrame. Example: df.groupby("dept")["salary"].mean() splits by department, calculates the mean salary for each, and returns a Series with one value per department.

Q: What is the difference between pd.merge() with how="inner" vs how="left"?

Inner: returns only rows where the key exists in BOTH DataFrames (intersection). Left: returns ALL rows from the left DataFrame, plus matching rows from the right. Unmatched right-side columns get NaN. This mirrors SQL's INNER JOIN vs LEFT JOIN exactly.

Q: How does pivot_table() work? Give a real example.

Pivot table reshapes long data into a wide summary: df.pivot_table( values="revenue", # what to calculate index="region", # rows columns="quarter", # columns aggfunc="sum" # aggregation function ) Result: regions as rows, quarters as columns, cells = total revenue. Like an Excel pivot table.

Q: A DataFrame has missing values. Walk through your decision process for handling them.

(1) Quantify: df.isna().sum() — how many nulls per column? (2) Understand why: missing at random? or systematically? (3) Decide: drop if few nulls and random (dropna()); fill with mean/median if numeric and random (fillna()); forward-fill for time series (ffill); use a flag column if missingness is informative (df["col_missing"] = df["col"].isna()). (4) Never blindly fill — filling salary with mean when high earners declined to report biases the analysis.

By QuickLearnPro Editorial · Editorial standards

TL;DR

30+ Python for Data Analytics interview questions with hidden answers, organized by topic. Click "Show Answer" to reveal. Perfect for a quick 15-minute revision before an interview.

Short on time? Focus on Pandas Basics, Data Manipulation, and Scenario Questions — these come up in almost every data analyst interview.

Pandas Basics

Q: What is the difference between a Pandas Series and a DataFrame?

A Series is 1-dimensional (a single column with an index). A DataFrame is 2-dimensional (a table with rows and columns). A DataFrame is essentially a dict of Series sharing the same index. df["col"] returns a Series; df[["col1","col2"]] returns a DataFrame.

Q: Name 3 ways to read data into a Pandas DataFrame.

pd.read_csv() for CSV files, pd.read_excel() for Excel files, pd.read_sql() for databases. Others: read_json(), read_parquet(), read_html() (scrapes tables from web pages), read_clipboard() (paste from Excel).

Q: What is the difference between df.loc[] and df.iloc[]?

loc selects by label (column names, index labels) and the end is inclusive. iloc selects by integer position and the end is exclusive (standard Python slicing). df.loc[0:3] returns 4 rows. df.iloc[0:3] returns 3 rows.

Q: How do you check the data types of each column? How do you change a column's type?

Check: df.dtypes or df.info(). Change: df["col"] = df["col"].astype("int32"). For dates: df["date"] = pd.to_datetime(df["date"]). For categorical: df["status"] = df["status"].astype("category"). Wrong dtypes are the #1 silent bug in data analysis.

Q: What is the index in a DataFrame? When would you set a custom index?

The index labels each row (default is 0, 1, 2...). Set a custom index with df.set_index("col") when you want fast lookups by that column, or when using .loc[] for label-based access. Common examples: using dates as the index for time series, or user_id for quick user lookups. Reset with df.reset_index().

Q: How do you filter rows where a column matches one of several values?

Use isin(): df[df["department"].isin(["Engineering", "Product", "Design"])]. This is the Pandas equivalent of SQL's WHERE department IN ('Engineering', 'Product', 'Design'). For the opposite (NOT IN), use ~df["department"].isin([...]).

Deeper coverage: Pandas Deep Dive

Data Manipulation

Q: Explain how groupby() works. What are the 3 steps?

GroupBy follows a split-apply-combine pattern: (1) Split the data into groups by key. (2) Apply a function to each group (mean, sum, custom). (3) Combine results into a new DataFrame. Example: df.groupby("dept")["salary"].mean() splits by department, calculates the mean salary for each, and returns a Series with one value per department.

Q: What is the difference between pd.merge() with how="inner" vs how="left"?

Inner: returns only rows where the key exists in BOTH DataFrames (intersection). Left: returns ALL rows from the left DataFrame, plus matching rows from the right. Unmatched right-side columns get NaN. This mirrors SQL's INNER JOIN vs LEFT JOIN exactly.

Q: How does pivot_table() work? Give a real example.

Pivot table reshapes long data into a wide summary:

df.pivot_table(
    values="revenue",      # what to calculate
    index="region",         # rows
    columns="quarter",      # columns
    aggfunc="sum"           # aggregation function
)

Result: regions as rows, quarters as columns, cells = total revenue. Like an Excel pivot table.

Q: A DataFrame has missing values. Walk through your decision process for handling them.

(1) Quantify: df.isna().sum() — how many nulls per column? (2) Understand why: missing at random? or systematically? (3) Decide: drop if few nulls and random (dropna()); fill with mean/median if numeric and random (fillna()); forward-fill for time series (ffill); use a flag column if missingness is informative (df["col_missing"] = df["col"].isna()). (4) Never blindly fill — filling salary with mean when high earners declined to report biases the analysis.

Q: What is method chaining in Pandas? Why is it useful?

Method chaining calls multiple operations in sequence without saving intermediate variables:

result = (
    df
    .query("year >= 2024")
    .groupby("dept")["salary"]
    .mean()
    .sort_values(ascending=False)
)

Benefits: reads top-to-bottom like a recipe, no clutter of temp variables, easy to add/remove steps. Use .assign() to create columns and .pipe() for custom functions mid-chain.

Q: What is the difference between concat() and merge()?

concat() stacks DataFrames vertically (like SQL UNION) or horizontally by position. merge() combines DataFrames based on a shared key column (like SQL JOIN). Use concat when you have the same columns and want to stack rows. Use merge when you want to enrich one DataFrame with columns from another based on a matching key.

Deeper coverage: Pandas Deep Dive

NumPy

Q: What is the difference between a Python list and a NumPy array?

A Python list holds mixed types and uses Python loops. A NumPy array is homogeneous (all elements same type) and operations run in optimized C code. This makes NumPy 10-100x faster for math. [1,2,3] * 2 gives [1,2,3,1,2,3] (list repeat). np.array([1,2,3]) * 2 gives [2,4,6] (element-wise).

Q: What is broadcasting in NumPy? Give an example.

Broadcasting lets NumPy do math on arrays with different shapes by automatically stretching the smaller one. Example: np.array([1,2,3]) + 10 → [11,12,13]. The scalar 10 is "broadcast" across all elements. Works for matrices too: a (3x3) matrix + a (1x3) row adds that row to every row of the matrix. No loops needed.

Q: Why is vectorization important? How does it relate to Pandas performance?

Vectorization means applying an operation to an entire array at once in C, instead of looping in Python. df["salary"] * 1.1 is vectorized — one call to C handles all rows. for idx, row in df.iterrows(): row["salary"] * 1.1 loops in Python — 100x slower. Pandas is built on NumPy, so every vectorized Pandas op is actually a vectorized NumPy op underneath.

Q: How do you generate reproducible random numbers with NumPy?

Set a seed before generating: np.random.seed(42) (legacy) or use the newer Generator API:

rng = np.random.default_rng(seed=42)
samples = rng.normal(loc=0, scale=1, size=1000)

Same seed = same sequence every time. Essential for reproducible analysis and testing.

Deeper coverage: Core Concepts

Visualization

Q: When should you use a bar chart vs a line chart vs a scatter plot?

Bar chart: comparing discrete categories (sales by region). Line chart: showing trends over a continuous ordered axis (revenue over months). Scatter plot: exploring the relationship between two numeric variables (salary vs experience). Key rule: line charts imply connection between points, so never use them for unordered categories.

Q: What is the difference between Matplotlib and Seaborn?

Matplotlib is the low-level library: full control, more code. Seaborn is built on Matplotlib: prettier defaults, built-in statistics (confidence intervals, density), and native DataFrame support. Use Matplotlib for custom non-standard charts. Use Seaborn for statistical visualizations (box, violin, heatmap, pairplot).

Q: How do you make a correlation heatmap in Seaborn?

import seaborn as sns
corr = df.corr()  # compute correlation matrix
sns.heatmap(corr, annot=True, fmt=".2f",
            cmap="coolwarm", center=0)
plt.title("Correlation Matrix")
plt.show()

annot=True prints values. center=0 makes colors symmetric. Look for values near +1 or -1 for strong relationships.

Q: How would you customize a Matplotlib chart for a professional presentation?

Key steps: (1) Set figure size: fig, ax = plt.subplots(figsize=(10, 6)). (2) Add labels: ax.set_title(), ax.set_xlabel(), ax.set_ylabel(). (3) Style: remove top/right spines, add subtle grid, use brand colors. (4) Legend: ax.legend(). (5) Layout: plt.tight_layout(). (6) Save at high DPI: fig.savefig("chart.png", dpi=150, bbox_inches="tight").

Deeper coverage: Data Visualization Deep Dive

SQL-to-Pandas Translation

Q: Translate to Pandas: SELECT name, salary FROM employees WHERE department = 'Engineering' ORDER BY salary DESC LIMIT 10

(
    df[df["department"] == "Engineering"][["name", "salary"]]
    .sort_values("salary", ascending=False)
    .head(10)
)

Or with query: df.query("department == 'Engineering'")[["name","salary"]].nlargest(10, "salary")

Q: Translate to Pandas: SELECT department, COUNT(*), AVG(salary) FROM employees GROUP BY department HAVING COUNT(*) > 5

result = (
    df.groupby("department")
    .agg(count=("id", "count"), avg_salary=("salary", "mean"))
    .query("count > 5")
)

The .query("count > 5") is the HAVING equivalent — it filters after grouping.

Q: Translate to Pandas: SELECT e.name, d.dept_name FROM employees e LEFT JOIN departments d ON e.dept_id = d.id

result = pd.merge(
    employees, departments,
    left_on="dept_id", right_on="id",
    how="left"
)[["name", "dept_name"]]

Key: how="left" keeps all employees, even those without a matching department (NaN in dept_name).

Q: Translate to Pandas: SELECT *, RANK() OVER (PARTITION BY department ORDER BY salary DESC) as rank FROM employees

df["rank"] = (
    df.groupby("department")["salary"]
    .rank(method="min", ascending=False)
    .astype(int)
)

Pandas rank(method="min") mirrors SQL RANK(). Use method="dense" for DENSE_RANK() and method="first" for ROW_NUMBER().

Full comparison table: SQL-to-Pandas Cheat Sheet

Scenario Questions

Q: You receive a messy CSV with mixed date formats, duplicate rows, inconsistent casing in category columns, and 15% missing values in the "revenue" column. Walk through how you would clean it.

Step-by-step:

# 1. Load and inspect
df = pd.read_csv("messy.csv")
print(df.shape, df.dtypes, df.isna().sum())

# 2. Fix dates
df["date"] = pd.to_datetime(df["date"], format="mixed")

# 3. Remove duplicates
df = df.drop_duplicates()

# 4. Standardize categories
df["category"] = df["category"].str.strip().str.lower()

# 5. Handle missing revenue
# Check if random or systematic first
df["revenue"] = df["revenue"].fillna(df.groupby("category")["revenue"].transform("median"))

# 6. Verify
print(df.isna().sum())  # should be 0
print(df.dtypes)         # dates should be datetime64

Q: Your manager asks for a weekly sales report. You have a DataFrame with daily sales. How would you build this?

# Ensure date column is datetime and set as index
df["date"] = pd.to_datetime(df["date"])
df = df.set_index("date")

# Resample to weekly frequency
weekly = df.resample("W").agg(
    total_sales=("revenue", "sum"),
    avg_order_value=("revenue", "mean"),
    num_orders=("order_id", "count")
)

# Add week-over-week change
weekly["wow_change"] = weekly["total_sales"].pct_change() * 100

# Visualize
weekly["total_sales"].plot(kind="bar", figsize=(12, 5),
                           title="Weekly Sales")
plt.tight_layout()
plt.show()

Q: You notice a sudden spike in website errors last Tuesday. You have a DataFrame with columns: timestamp, error_type, endpoint, user_id. How would you investigate?

# 1. Isolate the spike day
tuesday = df[df["timestamp"].dt.date == pd.Timestamp("2026-03-24").date()]

# 2. What types of errors?
tuesday["error_type"].value_counts()

# 3. Which endpoints are affected?
tuesday.groupby(["endpoint", "error_type"]).size().sort_values(ascending=False).head(10)

# 4. Timeline — when exactly did it start?
tuesday.set_index("timestamp").resample("1h").size().plot(title="Errors per Hour")

# 5. Is it one user or many?
tuesday["user_id"].nunique()  # vs total errors

# 6. Compare to a normal day
normal = df[df["timestamp"].dt.date == pd.Timestamp("2026-03-17").date()]
print(f"Tuesday: {len(tuesday)} errors, Normal: {len(normal)} errors")

Approach: quantify the spike, break down by dimension (type, endpoint, time, user), compare to baseline.

Q: Your Pandas code takes 10 minutes to run on a 5 million row DataFrame. How would you optimize it?

Systematic approach: (1) Profile first: use %%timeit or line_profiler to find the slow line. (2) Replace loops: iterrows() and apply() are the usual culprits. Rewrite with vectorized ops. (3) Downcast types: int64 → int32, float64 → float32, object → category. Check with df.info(memory_usage="deep"). (4) Filter early: remove unneeded rows/columns before heavy operations. (5) Use efficient methods: nlargest() instead of sort_values().head(). (6) Read selectively: usecols, nrows, or switch to Parquet (columnar, compressed). (7) Last resort: switch to Polars (Rust-based, 5-10x faster) or Dask (parallelized Pandas).

Q: You have two DataFrames: last month's product catalog and this month's product catalog. How would you find products that were added, removed, and price-changed?

# Merge with indicator to see which rows match
comparison = pd.merge(
    last_month, this_month,
    on="product_id", how="outer",
    suffixes=("_old", "_new"),
    indicator=True
)

# New products (only in this month)
added = comparison[comparison["_merge"] == "right_only"]

# Removed products (only in last month)
removed = comparison[comparison["_merge"] == "left_only"]

# Price changes (in both, but different price)
both = comparison[comparison["_merge"] == "both"]
price_changed = both[both["price_old"] != both["price_new"]]
price_changed["price_diff"] = price_changed["price_new"] - price_changed["price_old"]

print(f"Added: {len(added)}, Removed: {len(removed)}, Price changed: {len(price_changed)}")

The indicator=True trick is the key — it tells you where each row came from.

Q: You are asked to analyze user retention. You have a DataFrame with user_id and login_date. How would you calculate the percentage of users who returned in their second month?

# 1. Find each user's first login month
df["login_month"] = df["login_date"].dt.to_period("M")
first_month = df.groupby("user_id")["login_month"].min().rename("cohort")

# 2. Merge cohort back
df = df.merge(first_month, on="user_id")

# 3. Calculate months since first login
df["months_since"] = (df["login_month"] - df["cohort"]).apply(lambda x: x.n)

# 4. Count unique users per cohort per month
cohort_data = df.groupby(["cohort", "months_since"])["user_id"].nunique().unstack()

# 5. Calculate retention rate
retention = cohort_data.div(cohort_data[0], axis=0) * 100
print(f"Month-1 retention: {retention[1].mean():.1f}%")

This is a cohort analysis — group users by their signup month, then track what percentage are still active in subsequent months.

Python for Analytics Interview Questions

Pandas Basics

Data Manipulation

NumPy

Visualization

SQL-to-Pandas Translation

Scenario Questions

Related Topics