Data Science Core Concepts

TL;DR

Data science has 6 building blocks: data collection, data wrangling, exploratory data analysis (EDA), feature engineering, modeling, and evaluation. Master these and you can tackle any data science project from start to finish.

Concept Map

These six building blocks form a pipeline. Each step feeds into the next, and you'll often loop back to earlier steps as you learn more about the data.

Data science concept map showing 6 building blocks and their relationships
Explain Like I'm 12

Think of data science like cooking a meal. Data collection is going to the grocery store. Data wrangling is washing and chopping the ingredients. EDA is tasting as you go to understand flavors. Feature engineering is adding the right spices. Modeling is actually cooking the dish (combining everything). Evaluation is having someone taste it and tell you if it's good. If it's not great, you go back and adjust!

Cheat Sheet

ConceptWhat It DoesKey Tools
Data CollectionGather raw data from APIs, databases, files, web scrapingSQL, requests, BeautifulSoup, APIs
Data WranglingClean missing values, fix types, remove duplicates, handle outlierspandas, NumPy
EDAExplore distributions, correlations, patterns visually and statisticallymatplotlib, seaborn, pandas .describe()
Feature EngineeringCreate new columns that help models learn betterpandas, scikit-learn transformers
ModelingTrain algorithms to find patterns and make predictionsscikit-learn, XGBoost, TensorFlow
EvaluationMeasure model performance with metrics, avoid overfittingaccuracy, precision, recall, RMSE, cross-validation

The Building Blocks

1. Data Collection

Every project starts with getting the data. This could mean querying a SQL database, calling a REST API, reading CSV files, or scraping web pages. The key question: Is your data representative of the problem you're solving?

import pandas as pd

# From CSV
df = pd.read_csv('sales_data.csv')

# From SQL database
import sqlalchemy
engine = sqlalchemy.create_engine('postgresql://user:pass@host/db')
df = pd.read_sql('SELECT * FROM sales WHERE year = 2025', engine)

# From API
import requests
response = requests.get('https://api.example.com/data')
df = pd.DataFrame(response.json()['results'])
Info: In practice, data comes from many sources at once. A single project might combine SQL tables, API responses, and flat files. Joining them correctly is half the battle.

2. Data Wrangling (Cleaning)

Real-world data is messy. Expect missing values, wrong data types, duplicates, and outliers. Data scientists spend 60-80% of their time on this step. The goal is a clean, consistent dataset ready for analysis.

import pandas as pd
import numpy as np

# Check for issues
print(df.info())          # data types & non-null counts
print(df.isnull().sum())  # missing values per column

# Handle missing values
df['age'].fillna(df['age'].median(), inplace=True)
df.dropna(subset=['target'], inplace=True)  # drop rows without target

# Fix data types
df['date'] = pd.to_datetime(df['date'])
df['category'] = df['category'].astype('category')

# Remove duplicates
df.drop_duplicates(inplace=True)
Tip: Always check df.info() and df.describe() first. They reveal 80% of data quality issues in seconds.

3. Exploratory Data Analysis (EDA)

EDA is about asking questions and looking for answers in the data before building any model. You use statistics and visualizations to understand distributions, spot outliers, find correlations, and generate hypotheses.

import matplotlib.pyplot as plt
import seaborn as sns

# Distribution of a numeric column
df['price'].hist(bins=50)
plt.title('Price Distribution')
plt.show()

# Correlation heatmap
corr = df.select_dtypes(include='number').corr()
sns.heatmap(corr, annot=True, cmap='coolwarm')
plt.title('Feature Correlations')
plt.show()

# Group statistics
df.groupby('category')['revenue'].agg(['mean', 'median', 'std'])
Info: EDA isn't optional. Skipping it means you might train a model on garbage data. Garbage in, garbage out.

4. Feature Engineering

This is where domain knowledge meets programming. Feature engineering means creating new columns (features) from existing data that help the model learn better. It's often the biggest factor in model performance.

# Extract date components
df['day_of_week'] = df['date'].dt.dayofweek
df['month'] = df['date'].dt.month
df['is_weekend'] = df['day_of_week'].isin([5, 6]).astype(int)

# Bin continuous variables
df['age_group'] = pd.cut(df['age'], bins=[0, 18, 35, 50, 65, 100],
                          labels=['teen', 'young', 'mid', 'senior', 'elderly'])

# Encode categorical variables
df = pd.get_dummies(df, columns=['category'], drop_first=True)

# Interaction features
df['price_per_sqft'] = df['price'] / df['sqft']
Tip: A simple model with great features often beats a complex model with raw features. Invest time here before reaching for fancier algorithms.

5. Modeling

Now you train an algorithm on your prepared data. The model learns patterns from training data and tries to generalize to unseen test data. Always split your data before training!

from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier

# Split: 80% train, 20% test
X = df.drop('target', axis=1)
y = df['target']
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Train
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)

# Predict
y_pred = model.predict(X_test)
Warning: Never evaluate your model on the same data you trained it on. This gives a falsely optimistic score — the model is just memorizing, not learning.

6. Evaluation

How do you know if your model is actually good? Use the right metric for your problem type and validate with cross-validation to ensure your results are stable.

from sklearn.metrics import (accuracy_score, precision_score,
                               recall_score, f1_score,
                               confusion_matrix, classification_report)
from sklearn.model_selection import cross_val_score

# Basic metrics
print(f"Accuracy:  {accuracy_score(y_test, y_pred):.3f}")
print(f"Precision: {precision_score(y_test, y_pred):.3f}")
print(f"Recall:    {recall_score(y_test, y_pred):.3f}")
print(f"F1 Score:  {f1_score(y_test, y_pred):.3f}")

# Full classification report
print(classification_report(y_test, y_pred))

# Cross-validation (more reliable)
cv_scores = cross_val_score(model, X, y, cv=5, scoring='f1')
print(f"CV F1: {cv_scores.mean():.3f} +/- {cv_scores.std():.3f}")
Problem TypeKey MetricsWhen to Use
ClassificationAccuracy, Precision, Recall, F1, AUC-ROCPredicting categories (spam/not spam)
RegressionMAE, MSE, RMSE, R²Predicting numbers (house price)
ClusteringSilhouette score, InertiaGrouping similar items
Info: Accuracy is misleading with imbalanced data. If 95% of emails are not spam, a model that always says "not spam" gets 95% accuracy but catches zero spam. Use precision, recall, and F1 instead.

Test Yourself

You have a dataset with 20% missing values in the "income" column. What are two strategies to handle this?

1. Imputation: Fill missing values with the median (robust to outliers) or mean. 2. Drop rows: If only a few rows are affected, remove them. 3. Model-based: Use algorithms that handle missing values natively (like XGBoost) or predict missing values from other columns. Choose based on how much data you can afford to lose and whether the missingness is random.

Why is cross-validation better than a single train/test split?

A single split might get lucky or unlucky depending on which rows land in test vs train. Cross-validation (e.g., 5-fold) trains and evaluates 5 times, each time using a different fold as the test set. This gives you a mean score and a standard deviation, so you know both how well the model performs and how stable that performance is.

When would you choose Recall over Precision as your primary metric?

Choose Recall when the cost of missing a positive case is high (false negatives are dangerous). Example: cancer detection — you'd rather flag healthy patients for extra tests (false positive) than miss a cancer case (false negative). Choose Precision when false positives are costly — e.g., spam filtering, where marking a real email as spam is worse than letting some spam through.

What is feature engineering and why does it matter?

Feature engineering is creating new input columns from existing data to help the model learn better. For example, extracting "day of week" from a date column, or creating "price per square foot" from price and area. It matters because models can only learn from the features you give them. A well-engineered feature can improve model performance more than switching to a fancier algorithm.

Why should you never evaluate a model on training data?

Training data evaluation measures memorization, not generalization. A complex model can score 100% on training data by memorizing every example, but fail on new data. This is called overfitting. Always evaluate on a held-out test set or use cross-validation to measure how well the model generalizes to unseen data.