Databases & Storage

TL;DR

Choose the right database: SQL (PostgreSQL, MySQL) for structured data with relationships, NoSQL (MongoDB, DynamoDB, Redis, Cassandra) for flexible schemas and scale. Know when to use each, plus caching layers, object storage, and data lakes.

Explain Like I'm 12

Think of databases like different types of storage in your house. A SQL database is like a filing cabinet with labeled folders — everything is organized in rows and columns, and you can find things by looking up the label. A NoSQL database is more like a big box where you can throw in anything — documents, photos, random notes — and find them using a tag or a key.

A cache is like keeping your most-used things on your desk instead of walking to the filing cabinet every time. Object storage is like a warehouse — cheap and huge, perfect for storing boxes of old photos and backups.

Database Landscape

Database types overview: SQL databases, NoSQL types, caching layers, and storage options

SQL vs NoSQL

This is the first question in any database decision. Both have strengths — the right choice depends on your data shape and access patterns.

Key insight: It's not "SQL or NoSQL" — most real systems use both. PostgreSQL for transactions, Redis for caching, S3 for files. The question is which is your primary data store.

Aspect	SQL (Relational)	NoSQL (Non-Relational)
Schema	Fixed schema (define tables and columns upfront)	Flexible / schema-less (data can vary per record)
Scaling	Primarily vertical (horizontal is harder)	Built for horizontal scaling (sharding is native)
ACID	Full ACID transactions	Varies (some offer ACID per document, most favor eventual consistency)
Joins	Native JOINs across tables	No joins (denormalize or do application-level joins)
Query language	SQL (standardized, powerful)	Varies per database (MongoDB queries, DynamoDB API, etc.)
Best for	Complex relationships, transactions, reporting	High-volume writes, flexible data, horizontal scale
Examples	PostgreSQL, MySQL, SQL Server, Oracle	MongoDB, DynamoDB, Cassandra, Redis, Neo4j

SQL Databases

PostgreSQL

The most feature-rich open-source database. Handles both OLTP and some OLAP workloads. Supports JSON/JSONB (so you get some NoSQL flexibility), full-text search, geospatial queries (PostGIS), and advanced indexing.

When to use: Default choice for most applications. Complex queries, foreign keys, transactions, reporting.
Strengths: JSONB columns, CTEs, window functions, extensions ecosystem
Scale: Vertical scaling + read replicas. Can handle millions of rows per table easily.

MySQL

The most widely deployed open-source database. Simpler than PostgreSQL, optimized for read-heavy workloads. Powers WordPress, many legacy systems, and a lot of web applications.

When to use: Read-heavy web apps, simpler schemas, when PostgreSQL is overkill
Strengths: Mature replication, huge community, well-understood performance characteristics
Limitation: Less advanced query features than PostgreSQL (no JSONB path queries, fewer window functions historically)

ACID properties refresher: Atomicity (all or nothing), Consistency (data stays valid), Isolation (transactions don't interfere), Durability (committed data survives crashes). These are guaranteed by SQL databases and are critical for financial transactions, inventory, and user accounts.

Indexing

Indexes are how databases make queries fast. Without an index, the database scans every row (full table scan). With an index, it jumps directly to matching rows.

B-tree index: Default type. Great for equality and range queries. Works on most column types.
Hash index: Only for exact equality lookups. Faster than B-tree for = but useless for <, >, or BETWEEN.
Composite index: Covers multiple columns. Follows the "leftmost prefix" rule — INDEX(a, b, c) helps queries on (a), (a,b), or (a,b,c), but NOT (b) alone.

Don't over-index. Every index speeds up reads but slows down writes (INSERT/UPDATE/DELETE must also update the index). Index columns you filter, sort, or join on frequently. Measure with EXPLAIN.

Normalization vs Denormalization

Aspect	Normalization	Denormalization
What	Split data into related tables, no duplication	Combine data into fewer tables, accept duplication
Writes	Faster (update one place)	Slower (update multiple copies)
Reads	Slower (need JOINs)	Faster (data already together)
Best for	OLTP, write-heavy, data integrity	OLAP, read-heavy, reporting, caching

NoSQL Database Types

Key insight: "NoSQL" is not one thing. It's four very different categories, each optimized for a specific data access pattern. Picking the wrong NoSQL type is worse than picking SQL.

Document Stores (MongoDB)

Store data as JSON-like documents. Each document can have a different structure. Great for content management, user profiles, product catalogs.

Model: Collections of documents (like JSON objects)
Query: Rich queries on document fields, aggregation pipeline
Scale: Built-in sharding on any field
Trade-off: No joins — denormalize or use application-side joins
Best for: Flexible schemas, rapid prototyping, content management systems

Key-Value Stores (Redis, DynamoDB)

Simplest model: a key maps to a value. Blazing fast because the access pattern is always "get by key."

Redis

In-memory: microsecond latency, used as a cache or session store
Data structures: strings, lists, sets, sorted sets, hashes, streams
Use cases: caching, session management, leaderboards, rate limiting, pub/sub
Limitation: data must fit in memory (though Redis persistence options exist)

DynamoDB

Managed: serverless, pay-per-request, auto-scales
Model: partition key + optional sort key, supports secondary indexes
Use cases: serverless apps, gaming leaderboards, IoT data, shopping carts
Limitation: query flexibility is limited — design your partition key carefully

Wide-Column Stores (Cassandra)

Optimized for massive write throughput across many nodes. Data is organized by partition key and sorted by clustering columns.

Scale: Linear scalability — add nodes to increase throughput
Writes: Extremely fast (append-only, no read-before-write)
Use cases: Time-series data, IoT sensor data, activity logs, messaging history
Trade-off: Limited query flexibility — you must design tables around your query patterns

Graph Databases (Neo4j)

Optimized for data with complex relationships. Nodes connected by edges (relationships). Queries traverse the graph.

Model: Nodes (entities) + Edges (relationships) with properties on both
Query: Cypher query language (Neo4j) — "find all friends of friends"
Use cases: Social networks, recommendation engines, fraud detection, knowledge graphs
Trade-off: Not suited for non-graph workloads, smaller community

Decision Table: Which NoSQL Type?

Your Data Shape	Recommended Type	Examples
JSON-like documents, flexible fields	Document store	MongoDB, Firestore
Simple key-value lookups, caching	Key-value store	Redis, DynamoDB, Memcached
Time-series, massive writes, append-only	Wide-column store	Cassandra, ScyllaDB, HBase
Complex relationships, traversals	Graph database	Neo4j, Amazon Neptune
Full-text search	Search engine	Elasticsearch, OpenSearch

Caching Strategies

Caching is the single biggest performance win in most systems. Put frequently accessed data in fast memory (Redis, Memcached) instead of hitting the slow database on every request.

Rule of thumb: If the same database query runs more than a few times per second, it should probably be cached.

Caching Patterns

Pattern	How it works	Best for
Cache-aside (Lazy loading)	App checks cache first. On miss, reads from DB, then writes to cache.	General purpose. Most common pattern.
Write-through	App writes to cache AND DB simultaneously. Cache is always up-to-date.	Read-heavy, consistency matters.
Write-back (Write-behind)	App writes to cache only. Cache async flushes to DB periodically.	Write-heavy workloads. Risk: data loss if cache crashes before flush.
Write-around	App writes directly to DB, bypassing cache. Cache fills on reads.	Write-heavy where written data isn't read immediately.

Redis vs Memcached

Feature	Redis	Memcached
Data types	Strings, lists, sets, hashes, sorted sets, streams	Strings only
Persistence	Yes (RDB snapshots, AOF)	No (pure cache)
Replication	Built-in (master-replica)	No native replication
Use case	Cache + data store (sessions, leaderboards, queues)	Simple, volatile caching

Cache Eviction Policies

LRU (Least Recently Used): evict the item that hasn't been accessed the longest. Most common.
LFU (Least Frequently Used): evict the item accessed the fewest times. Better for skewed access patterns.
TTL (Time To Live): each item expires after a set duration. Simple but effective.

Multi-Level Caching

In a real system, caching happens at multiple layers. A request passes through each layer, and any layer can serve the cached response:

Browser cache → CDN edge cache → API gateway cache → Application cache (Redis) → Database query cache → Database disk

Cache invalidation is one of the two hardest problems in computer science (along with naming things). When the underlying data changes, stale cache entries must be removed or updated. Strategies: TTL expiry, event-driven invalidation (publish cache-bust on write), versioned keys.

Object Storage

Object storage (S3, Google Cloud Storage, Azure Blob Storage) is for unstructured data: files, images, videos, backups, logs. It's not a database — you can't query it with SQL.

Key characteristics: Virtually unlimited storage, extremely cheap ($0.023/GB/month for S3 Standard), 11 nines of durability (99.999999999%), accessible via HTTP (REST API). No random access to file internals — you upload and download whole objects.

User uploads: profile pictures, documents, attachments
Static assets: CSS, JS, images served via CDN
Backups: database dumps, log archives
Data lake storage: raw data for analytics (Parquet, JSON, CSV files)

Architecture pattern: Never store files in your database (BLOBs). Store them in S3 and save the S3 URL in your database. This keeps your DB small, backups fast, and lets you serve files via CDN.

Data Lakes & Warehouses

When you need to analyze large volumes of data (not just serve it to users), you need analytics-optimized storage.

Aspect	Data Lake	Data Warehouse	Lakehouse
Schema	Schema-on-read (define structure at query time)	Schema-on-write (define structure at load time)	Both (open table formats)
Data types	Any (raw JSON, CSV, Parquet, images, logs)	Structured (tables, columns, types)	Any, with optional structure
Cost	Very cheap (object storage pricing)	Expensive (compute + storage bundled)	Moderate (cheap storage, pay for compute)
Query speed	Slow (must process raw data)	Fast (pre-optimized, columnar storage)	Fast (indexing + columnar)
Examples	S3 + Spark, HDFS	Snowflake, BigQuery, Redshift	Delta Lake, Apache Iceberg, Hudi
Best for	Raw data storage, ML training data, exploration	BI dashboards, reporting, analytics	Unifying both workloads

The trend: The industry is moving toward the lakehouse pattern — open table formats (Delta Lake, Apache Iceberg) on top of cheap object storage (S3) with fast query engines (Spark, Trino, DuckDB). You get data lake pricing with warehouse query performance.

Database Selection Framework

When someone asks "which database should we use?" in a system design interview, walk through this decision tree:

Step 1 — Data shape: Is your data structured with relationships (users, orders, products)? Start with SQL. Is it flexible, nested, or document-like? Consider a document store. Is it just key-value lookups? Redis or DynamoDB.

Question	If Yes	Examples
Structured data with relationships?	SQL (PostgreSQL, MySQL)	Users, orders, inventory, banking
Flexible schema, document-oriented?	Document store (MongoDB)	CMS, product catalog, user profiles
Simple key lookups, sub-ms latency?	Key-value (Redis)	Caching, sessions, feature flags
Serverless, auto-scaling key-value?	DynamoDB	Shopping cart, gaming, IoT
Massive writes, time-series?	Wide-column (Cassandra)	Sensor data, logs, messaging
Complex relationships, traversals?	Graph (Neo4j)	Social graphs, fraud detection
Full-text search?	Elasticsearch / OpenSearch	Product search, log analysis
Files, images, videos?	Object storage (S3)	User uploads, backups, static assets
Large-scale analytics?	Data warehouse / lakehouse	BI dashboards, ML training data

Common interview mistake: Saying "I'd use MongoDB because it's popular" without explaining why. Always justify your database choice with the data access pattern, consistency requirements, and scale needs.

Test Yourself

Q: You're building an e-commerce platform with products, orders, and user accounts. Which database would you choose as the primary data store and why?

PostgreSQL (SQL). The data is highly structured with clear relationships (users have orders, orders have items, items reference products). You need ACID transactions for payments and inventory. SQL gives you JOINs for reporting ("top-selling products this week"). Start with PostgreSQL, add Redis for caching hot product pages and sessions, and S3 for product images.

Q: What is the difference between cache-aside and write-through caching? When would you pick each?

Cache-aside: the app reads from cache; on a miss, it reads from the DB and fills the cache. Data may be stale until the next read misses. Write-through: every write goes to both cache and DB simultaneously. The cache is always up-to-date but writes are slower (two writes per operation). Use cache-aside for general-purpose caching where occasional staleness is ok. Use write-through when read consistency matters (e.g., user profile data that must be current after an update).

Q: Why can't you just use Redis as your only database?

Redis stores everything in memory, which is expensive ($$$) compared to disk-based databases. A 1TB Redis cluster costs orders of magnitude more than a 1TB PostgreSQL instance. Redis also has limited query capabilities (no JOINs, no complex filters), limited persistence guarantees (data loss risk on crashes with default config), and all data must fit in RAM. Use Redis for caching and specific use cases (sessions, leaderboards), not as a general-purpose primary database.

Q: When would you choose Cassandra over PostgreSQL?

Choose Cassandra when you have: (1) Massive write throughput needs (millions of writes/second). (2) Time-series or append-only data (sensor readings, logs, event streams). (3) Multi-datacenter deployment with eventual consistency acceptable. (4) Linear scalability needs (just add nodes). Don't choose Cassandra for: complex queries, ad-hoc analytics, data with many relationships, or when you need strong consistency and ACID transactions.

Q: What is the difference between a data lake and a data warehouse? When would you use each?

Data lake stores raw, unstructured data cheaply (schema-on-read). Use it for storing everything — logs, JSON events, images, ML training data — when you don't know how you'll use the data yet. Data warehouse stores structured, processed data with defined schemas (schema-on-write). Use it for BI dashboards, reports, and fast analytical queries. Most companies have both: raw data lands in the lake, gets cleaned/transformed, and is loaded into the warehouse for analysis.

Interview Questions

Q: Design the database schema for a Twitter-like social media platform. What databases would you use and why?

Multiple databases for different needs: (1) PostgreSQL for user accounts, relationships (follows), and metadata — structured data with ACID for account operations. (2) Cassandra for the tweet timeline — massive write throughput (millions of tweets/day), time-series access pattern, partition by user_id and sort by timestamp. (3) Redis for caching hot timelines, user sessions, and trending topics. (4) Elasticsearch for tweet search and hashtag search. (5) S3 for media (images, videos). The key insight is that no single database handles all these access patterns well.

Q: How would you handle database migrations in a system with zero downtime?

Use the expand-contract pattern: (1) Expand: add new column/table alongside the old one. Deploy code that writes to both old and new. (2) Migrate: backfill existing data from old to new. (3) Switch: deploy code that reads from new, still writes to both. (4) Contract: remove old column/table and code that writes to it. Never do destructive schema changes (dropping columns) in a single deployment. Each step is backward-compatible, so you can roll back at any point without data loss.

Q: Explain the concept of "database per microservice." What problems does it solve and create?

Solves: (1) Services are decoupled — each team controls its own schema. (2) Independent scaling — a write-heavy service can use Cassandra while another uses PostgreSQL. (3) Failure isolation — one database going down doesn't take down all services. Creates: (1) No cross-service JOINs — must use API calls or event-driven data sync. (2) Distributed transactions are hard — use saga pattern or eventual consistency. (3) Data duplication across services. (4) More operational overhead (more databases to manage). The trade-off is worth it at scale but overkill for small teams.

Q: Your application's database CPU is at 90% during peak hours. Walk through your optimization strategy.

Systematic approach: (1) Identify slow queries — check slow query logs, run EXPLAIN on top queries. (2) Add indexes on columns used in WHERE, JOIN, ORDER BY. (3) Add a cache layer (Redis) for the most frequent queries. (4) Optimize queries — remove SELECT *, add pagination, avoid N+1 queries. (5) Read replicas for read-heavy workloads. (6) Connection pooling (PgBouncer) to reduce connection overhead. (7) Vertical scale (bigger machine) for quick relief. (8) Denormalize hot tables for read performance. (9) Archive old data to a separate table or cold storage. Only consider sharding as a last resort.

Databases & Storage

Database Landscape

SQL vs NoSQL

SQL Databases

PostgreSQL

MySQL

Indexing

Normalization vs Denormalization

NoSQL Database Types

Document Stores (MongoDB)

Key-Value Stores (Redis, DynamoDB)

Redis

DynamoDB

Wide-Column Stores (Cassandra)

Graph Databases (Neo4j)

Decision Table: Which NoSQL Type?

Caching Strategies

Caching Patterns

Redis vs Memcached

Cache Eviction Policies

Multi-Level Caching

Object Storage

Data Lakes & Warehouses

Database Selection Framework

Test Yourself

Interview Questions

Related Topics