Data Visualization with Python

TL;DR

Create impactful charts with Matplotlib and Seaborn: bar, line, scatter, histogram, heatmap, box plots. Learn which chart type to use for which data, and how to make publication-quality visuals.

Explain Like I'm 12

Numbers in a spreadsheet are boring. Charts turn those numbers into pictures your brain can understand instantly. A bar chart says "Team A scored more than Team B" without reading a single number. A line chart shows if something is going up or down over time.

Matplotlib is the basic drawing tool — you can make any chart but you have to tell it exactly how. Seaborn is like Matplotlib with built-in templates — it makes statistical charts look great with way less code.

Matplotlib Basics

Matplotlib is the foundation of Python visualization. Every other library (Seaborn, Pandas plots) is built on top of it. Understand Figure and Axes, and everything else clicks.

The Figure/Axes model

import matplotlib.pyplot as plt

# Method 1: Quick and simple
plt.plot([1, 2, 3, 4], [10, 20, 25, 30])
plt.title("Simple Line")
plt.xlabel("X")
plt.ylabel("Y")
plt.show()

# Method 2: Object-oriented (recommended for anything beyond a quick sketch)
fig, ax = plt.subplots(figsize=(8, 5))
ax.plot([1, 2, 3, 4], [10, 20, 25, 30])
ax.set_title("Simple Line")
ax.set_xlabel("X")
ax.set_ylabel("Y")
plt.show()
Always use the object-oriented API (fig, ax = plt.subplots()) for real work. The plt.plot() shortcut works for quick experiments, but the OO API gives you full control and works properly with subplots.

Chart Types with Code

Line chart — trends over time

fig, ax = plt.subplots(figsize=(8, 5))

months = ["Jan", "Feb", "Mar", "Apr", "May", "Jun"]
revenue = [12000, 15000, 13500, 18000, 21000, 19500]

ax.plot(months, revenue, marker="o", linewidth=2, color="#6366f1")
ax.set_title("Monthly Revenue 2025")
ax.set_ylabel("Revenue ($)")
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
When to use: Showing how a value changes over a continuous axis (time, sequence). Best for 1-5 lines. More than that becomes unreadable.

Bar chart — comparing categories

fig, ax = plt.subplots(figsize=(8, 5))

departments = ["Eng", "Sales", "Marketing", "Product", "Support"]
headcount = [45, 30, 20, 15, 25]

bars = ax.bar(departments, headcount, color="#6366f1", edgecolor="white")
ax.set_title("Headcount by Department")
ax.set_ylabel("Employees")

# Add value labels on top of each bar
for bar in bars:
    height = bar.get_height()
    ax.text(bar.get_x() + bar.get_width() / 2, height + 0.5,
            str(int(height)), ha="center", va="bottom", fontweight="bold")

plt.tight_layout()
plt.show()
When to use: Comparing discrete categories. Use horizontal bars (ax.barh()) when category names are long.

Scatter plot — correlations

import numpy as np

fig, ax = plt.subplots(figsize=(8, 5))

# Simulated data
np.random.seed(42)
hours_studied = np.random.uniform(1, 10, 50)
test_scores = hours_studied * 8 + np.random.normal(0, 5, 50)

ax.scatter(hours_studied, test_scores, alpha=0.7, color="#6366f1", edgecolors="white")
ax.set_title("Study Hours vs Test Score")
ax.set_xlabel("Hours Studied")
ax.set_ylabel("Test Score")
plt.tight_layout()
plt.show()
When to use: Exploring the relationship between two numeric variables. Add a trend line with np.polyfit() to show correlation direction.

Histogram — distributions

fig, ax = plt.subplots(figsize=(8, 5))

salaries = np.random.normal(75000, 15000, 1000)  # simulated

ax.hist(salaries, bins=30, color="#6366f1", edgecolor="white", alpha=0.8)
ax.set_title("Salary Distribution")
ax.set_xlabel("Salary ($)")
ax.set_ylabel("Count")
ax.axvline(np.mean(salaries), color="#ef4444", linestyle="--", label=f"Mean: ${np.mean(salaries):,.0f}")
ax.legend()
plt.tight_layout()
plt.show()
When to use: Understanding the shape of a single numeric variable (normal? skewed? bimodal?). The number of bins matters — too few hides patterns, too many creates noise.

Pie chart — proportions (use with caution)

fig, ax = plt.subplots(figsize=(6, 6))

labels = ["Product A", "Product B", "Product C", "Product D"]
sizes = [40, 30, 20, 10]
colors = ["#6366f1", "#10b981", "#f59e0b", "#ef4444"]

ax.pie(sizes, labels=labels, colors=colors, autopct="%1.0f%%",
       startangle=90, wedgeprops={"edgecolor": "white", "linewidth": 2})
ax.set_title("Revenue by Product")
plt.tight_layout()
plt.show()
Pie charts are often the wrong choice. Humans are bad at comparing angles. A bar chart almost always communicates proportions more clearly. Only use pie charts when you have 3-5 categories and the exact percentages matter less than the general "big vs small" impression.

Seaborn — Statistical Visualization

Seaborn wraps Matplotlib with better defaults and built-in statistical calculations. It works directly with Pandas DataFrames.

Distribution: histplot

import seaborn as sns
import pandas as pd

# Seaborn works best with DataFrames
df = pd.DataFrame({"salary": np.random.normal(75000, 15000, 500),
                    "dept": np.random.choice(["Eng", "Sales", "Marketing"], 500)})

fig, ax = plt.subplots(figsize=(8, 5))
sns.histplot(data=df, x="salary", hue="dept", bins=25, alpha=0.6, ax=ax)
ax.set_title("Salary Distribution by Department")
plt.tight_layout()
plt.show()

Comparison: boxplot

fig, ax = plt.subplots(figsize=(8, 5))
sns.boxplot(data=df, x="dept", y="salary", palette="Set2", ax=ax)
ax.set_title("Salary by Department")
plt.tight_layout()
plt.show()
Reading a box plot: The box spans Q1 to Q3 (middle 50% of data). The line inside is the median. Whiskers extend to 1.5x the interquartile range. Dots beyond are outliers.

Correlation: heatmap

# Create a correlation matrix from your DataFrame
corr = df[["salary", "years_exp", "satisfaction", "projects"]].corr()

fig, ax = plt.subplots(figsize=(8, 6))
sns.heatmap(corr, annot=True, fmt=".2f", cmap="coolwarm", center=0,
            square=True, linewidths=1, ax=ax)
ax.set_title("Feature Correlation Matrix")
plt.tight_layout()
plt.show()
Interpretation: Values near +1 mean strong positive correlation (both go up together). Near -1 means strong negative (one goes up, the other goes down). Near 0 means no linear relationship. Always use center=0 so the color scale is symmetric.

Explore everything: pairplot

# Scatter plots for every pair of numeric columns
# Diagonal shows the distribution of each variable
sns.pairplot(df, hue="dept", palette="Set2", height=2.5)
plt.suptitle("Pairwise Relationships", y=1.02)
plt.show()

Categorical: catplot

# Combines box, violin, bar, swarm, etc. in one function
sns.catplot(data=df, x="dept", y="salary", hue="level",
            kind="violin", split=True, height=5, aspect=1.5)
plt.title("Salary Distribution by Department and Level")
plt.show()

Customization

The difference between a quick plot and a presentation-ready visual is customization. Here are the essentials.

Colors, labels, and legends

fig, ax = plt.subplots(figsize=(10, 6))

ax.plot(months, revenue, color="#6366f1", linewidth=2.5, label="Revenue")
ax.plot(months, costs, color="#ef4444", linewidth=2.5, linestyle="--", label="Costs")

ax.set_title("Revenue vs Costs", fontsize=16, fontweight="bold")
ax.set_xlabel("Month", fontsize=12)
ax.set_ylabel("Amount ($)", fontsize=12)
ax.legend(loc="upper left", frameon=True, fontsize=11)
ax.grid(True, alpha=0.3)

# Remove top and right spines for a cleaner look
ax.spines["top"].set_visible(False)
ax.spines["right"].set_visible(False)

plt.tight_layout()
plt.show()

Seaborn themes

# Set a global theme (affects all subsequent plots)
sns.set_theme(style="whitegrid")    # options: white, dark, whitegrid, darkgrid, ticks
sns.set_palette("Set2")             # color palette

# Or use a context for font scaling
sns.set_context("talk")  # options: paper, notebook (default), talk, poster
Color palette guide: Use "Set2" or "tab10" for categorical data. Use "coolwarm" or "RdYlGn" for diverging data. Use "Blues" or "viridis" for sequential data. Avoid red/green combinations for accessibility.

Which Chart for Which Data?

Choosing the right chart is more important than making it pretty. Here is a decision table.

Your Question Data Shape Chart Type Code
How does X change over time? Numeric over time Line chart ax.plot()
Which category is biggest? Categories vs values Bar chart ax.bar()
Are X and Y related? Two numeric variables Scatter plot ax.scatter()
What does the distribution look like? One numeric variable Histogram ax.hist() / sns.histplot()
How do groups compare (with outliers)? Groups of numeric values Box plot sns.boxplot()
What correlates with what? Many numeric columns Heatmap sns.heatmap()
What fraction is each part? 3-5 categories, proportions Pie / Donut ax.pie()
Explore all relationships at once? Multi-column DataFrame Pair plot sns.pairplot()

Subplots for Dashboards

Combine multiple charts into a single figure to tell a multi-faceted story.

# 2x2 grid of subplots
fig, axes = plt.subplots(2, 2, figsize=(12, 10))

# Top-left: line chart
axes[0, 0].plot(months, revenue, marker="o", color="#6366f1")
axes[0, 0].set_title("Monthly Revenue")

# Top-right: bar chart
axes[0, 1].bar(departments, headcount, color="#10b981")
axes[0, 1].set_title("Headcount")

# Bottom-left: histogram
axes[1, 0].hist(salaries, bins=25, color="#f59e0b", edgecolor="white")
axes[1, 0].set_title("Salary Distribution")

# Bottom-right: scatter
axes[1, 1].scatter(hours_studied, test_scores, alpha=0.6, color="#ef4444")
axes[1, 1].set_title("Study vs Score")

fig.suptitle("Company Dashboard", fontsize=16, fontweight="bold")
plt.tight_layout()
plt.show()
Pro tip: Use plt.tight_layout() or fig.tight_layout() to avoid overlapping labels. For more control, use fig.subplots_adjust() or GridSpec.

Saving Figures

# Save as PNG (default 100 DPI)
fig.savefig("chart.png", dpi=150, bbox_inches="tight")

# Save as SVG (vector, scales perfectly)
fig.savefig("chart.svg", bbox_inches="tight")

# Save as PDF (great for reports)
fig.savefig("chart.pdf", bbox_inches="tight")

# Transparent background (useful for presentations)
fig.savefig("chart.png", dpi=150, transparent=True, bbox_inches="tight")
bbox_inches="tight" crops whitespace around the chart. Always include it unless you need specific margins. Use DPI 150+ for anything that will be printed or presented.

Common Mistakes

1. Unlabeled axes. Every axis needs a label. A chart without labels is just a picture.
2. Misleading Y-axis. Starting the Y-axis at a value other than zero can exaggerate differences. A bar going from 98 to 100 looks like a 50% increase if the axis starts at 96. Always start at zero for bar charts.
3. Too many categories. A pie chart with 15 slices or a bar chart with 50 bars is unreadable. Group small categories into "Other" or use a different chart type.
4. Wrong chart type. A pie chart for time series data, or a line chart for unrelated categories. Refer to the decision table above.
5. No plt.tight_layout(). Labels get cut off. Always call it before show() or savefig().

Test Yourself

Q: What is the difference between plt.plot() and the object-oriented approach fig, ax = plt.subplots()?

plt.plot() is a stateful shortcut that acts on the "current" axes. The OO approach (fig, ax) explicitly creates a Figure and Axes object, giving you full control. The OO approach is required for subplots, multiple axes, and is the recommended style for all non-trivial charts.

Q: When should you use a scatter plot vs a line chart?

Line chart: when data points are connected and ordered (e.g., temperature over months — each point follows the previous). Scatter plot: when points are independent observations (e.g., height vs weight of different people — no inherent order). Using a line chart on unconnected data implies a false trend.

Q: How do you show distributions of a variable broken down by category in Seaborn?

Several options: (1) sns.histplot(data=df, x="value", hue="category") for overlapping histograms. (2) sns.boxplot(data=df, x="category", y="value") for summary statistics with outliers. (3) sns.violinplot() for shape + density. (4) sns.kdeplot() for smooth density curves.

Q: What does sns.heatmap(corr, annot=True, center=0) do?

It draws a color-coded correlation matrix. annot=True writes the correlation values in each cell. center=0 makes the color scale symmetric around zero, so positive correlations are one color and negative another, with white at zero. This makes it easy to spot strong positive and negative relationships at a glance.

Q: Name 3 things you should check before presenting a chart to stakeholders.

(1) Axis labels and title — every axis must be labeled with units. (2) Y-axis starts at zero for bar charts (to avoid misleading proportions). (3) Legend is clear — if there are multiple lines/colors, the legend must explain each. Bonus: check font size is readable, colors are accessible, and the chart answers a clear question.

Interview Questions

Q: You are given a dataset with sales by region and quarter. How would you visualize it to compare both across regions and across time?

A grouped bar chart (bars grouped by region, colored by quarter) or a heatmap (regions as rows, quarters as columns, color = sales). For trends over time per region, use a line chart with one line per region. For a single view, Seaborn's sns.catplot(kind="bar", x="quarter", hue="region") works well.

Q: What is the difference between Matplotlib and Seaborn? When would you choose one over the other?

Matplotlib is the low-level library: maximum flexibility, more code. Seaborn is built on Matplotlib: prettier defaults, built-in statistical calculations (confidence intervals, distributions), and works directly with DataFrames. Choose Matplotlib when you need fine-grained control or non-standard charts. Choose Seaborn for statistical plots (box, violin, heatmap, pairplot) and when you want good-looking output fast.

Q: How would you customize a Matplotlib chart to match your company's brand colors and fonts?

Create a custom style: (1) Define colors as a list: brand_colors = ["#1a73e8", "#34a853", ...]. (2) Set them globally: plt.rcParams["axes.prop_cycle"] = plt.cycler(color=brand_colors). (3) Set fonts: plt.rcParams["font.family"] = "Arial". (4) Alternatively, create a .mplstyle file and load it with plt.style.use("my_brand.mplstyle"). This ensures every chart in a notebook matches the brand.

Q: A stakeholder says your chart is "misleading." What are the most common ways charts mislead, and how do you avoid them?

Common pitfalls: (1) Truncated Y-axis — starting a bar chart at 50 instead of 0 exaggerates differences. Fix: start at zero. (2) Cherry-picked time range — showing only the good months. Fix: show the full period. (3) Dual axes with different scales — implies correlation where none exists. Fix: use separate charts. (4) 3D effects — distort perception. Fix: use flat 2D charts. (5) Wrong chart type — pie chart for 15 categories. Fix: match chart to data type.