Databases & Storage

TL;DR

Choose the right database: SQL (PostgreSQL, MySQL) for structured data with relationships, NoSQL (MongoDB, DynamoDB, Redis, Cassandra) for flexible schemas and scale. Know when to use each, plus caching layers, object storage, and data lakes.

Explain Like I'm 12

Think of databases like different types of storage in your house. A SQL database is like a filing cabinet with labeled folders — everything is organized in rows and columns, and you can find things by looking up the label. A NoSQL database is more like a big box where you can throw in anything — documents, photos, random notes — and find them using a tag or a key.

A cache is like keeping your most-used things on your desk instead of walking to the filing cabinet every time. Object storage is like a warehouse — cheap and huge, perfect for storing boxes of old photos and backups.

Database Landscape

Database types overview: SQL databases, NoSQL types, caching layers, and storage options

SQL vs NoSQL

This is the first question in any database decision. Both have strengths — the right choice depends on your data shape and access patterns.

Key insight: It's not "SQL or NoSQL" — most real systems use both. PostgreSQL for transactions, Redis for caching, S3 for files. The question is which is your primary data store.
Aspect SQL (Relational) NoSQL (Non-Relational)
Schema Fixed schema (define tables and columns upfront) Flexible / schema-less (data can vary per record)
Scaling Primarily vertical (horizontal is harder) Built for horizontal scaling (sharding is native)
ACID Full ACID transactions Varies (some offer ACID per document, most favor eventual consistency)
Joins Native JOINs across tables No joins (denormalize or do application-level joins)
Query language SQL (standardized, powerful) Varies per database (MongoDB queries, DynamoDB API, etc.)
Best for Complex relationships, transactions, reporting High-volume writes, flexible data, horizontal scale
Examples PostgreSQL, MySQL, SQL Server, Oracle MongoDB, DynamoDB, Cassandra, Redis, Neo4j

SQL Databases

PostgreSQL

The most feature-rich open-source database. Handles both OLTP and some OLAP workloads. Supports JSON/JSONB (so you get some NoSQL flexibility), full-text search, geospatial queries (PostGIS), and advanced indexing.

  • When to use: Default choice for most applications. Complex queries, foreign keys, transactions, reporting.
  • Strengths: JSONB columns, CTEs, window functions, extensions ecosystem
  • Scale: Vertical scaling + read replicas. Can handle millions of rows per table easily.

MySQL

The most widely deployed open-source database. Simpler than PostgreSQL, optimized for read-heavy workloads. Powers WordPress, many legacy systems, and a lot of web applications.

  • When to use: Read-heavy web apps, simpler schemas, when PostgreSQL is overkill
  • Strengths: Mature replication, huge community, well-understood performance characteristics
  • Limitation: Less advanced query features than PostgreSQL (no JSONB path queries, fewer window functions historically)
ACID properties refresher: Atomicity (all or nothing), Consistency (data stays valid), Isolation (transactions don't interfere), Durability (committed data survives crashes). These are guaranteed by SQL databases and are critical for financial transactions, inventory, and user accounts.

Indexing

Indexes are how databases make queries fast. Without an index, the database scans every row (full table scan). With an index, it jumps directly to matching rows.

  • B-tree index: Default type. Great for equality and range queries. Works on most column types.
  • Hash index: Only for exact equality lookups. Faster than B-tree for = but useless for <, >, or BETWEEN.
  • Composite index: Covers multiple columns. Follows the "leftmost prefix" rule — INDEX(a, b, c) helps queries on (a), (a,b), or (a,b,c), but NOT (b) alone.
Don't over-index. Every index speeds up reads but slows down writes (INSERT/UPDATE/DELETE must also update the index). Index columns you filter, sort, or join on frequently. Measure with EXPLAIN.

Normalization vs Denormalization

Aspect Normalization Denormalization
What Split data into related tables, no duplication Combine data into fewer tables, accept duplication
Writes Faster (update one place) Slower (update multiple copies)
Reads Slower (need JOINs) Faster (data already together)
Best for OLTP, write-heavy, data integrity OLAP, read-heavy, reporting, caching

NoSQL Database Types

Key insight: "NoSQL" is not one thing. It's four very different categories, each optimized for a specific data access pattern. Picking the wrong NoSQL type is worse than picking SQL.

Document Stores (MongoDB)

Store data as JSON-like documents. Each document can have a different structure. Great for content management, user profiles, product catalogs.

  • Model: Collections of documents (like JSON objects)
  • Query: Rich queries on document fields, aggregation pipeline
  • Scale: Built-in sharding on any field
  • Trade-off: No joins — denormalize or use application-side joins
  • Best for: Flexible schemas, rapid prototyping, content management systems

Key-Value Stores (Redis, DynamoDB)

Simplest model: a key maps to a value. Blazing fast because the access pattern is always "get by key."

Redis

  • In-memory: microsecond latency, used as a cache or session store
  • Data structures: strings, lists, sets, sorted sets, hashes, streams
  • Use cases: caching, session management, leaderboards, rate limiting, pub/sub
  • Limitation: data must fit in memory (though Redis persistence options exist)

DynamoDB

  • Managed: serverless, pay-per-request, auto-scales
  • Model: partition key + optional sort key, supports secondary indexes
  • Use cases: serverless apps, gaming leaderboards, IoT data, shopping carts
  • Limitation: query flexibility is limited — design your partition key carefully

Wide-Column Stores (Cassandra)

Optimized for massive write throughput across many nodes. Data is organized by partition key and sorted by clustering columns.

  • Scale: Linear scalability — add nodes to increase throughput
  • Writes: Extremely fast (append-only, no read-before-write)
  • Use cases: Time-series data, IoT sensor data, activity logs, messaging history
  • Trade-off: Limited query flexibility — you must design tables around your query patterns

Graph Databases (Neo4j)

Optimized for data with complex relationships. Nodes connected by edges (relationships). Queries traverse the graph.

  • Model: Nodes (entities) + Edges (relationships) with properties on both
  • Query: Cypher query language (Neo4j) — "find all friends of friends"
  • Use cases: Social networks, recommendation engines, fraud detection, knowledge graphs
  • Trade-off: Not suited for non-graph workloads, smaller community

Decision Table: Which NoSQL Type?

Your Data Shape Recommended Type Examples
JSON-like documents, flexible fields Document store MongoDB, Firestore
Simple key-value lookups, caching Key-value store Redis, DynamoDB, Memcached
Time-series, massive writes, append-only Wide-column store Cassandra, ScyllaDB, HBase
Complex relationships, traversals Graph database Neo4j, Amazon Neptune
Full-text search Search engine Elasticsearch, OpenSearch

Caching Strategies

Caching is the single biggest performance win in most systems. Put frequently accessed data in fast memory (Redis, Memcached) instead of hitting the slow database on every request.

Rule of thumb: If the same database query runs more than a few times per second, it should probably be cached.

Caching Patterns

Pattern How it works Best for
Cache-aside (Lazy loading) App checks cache first. On miss, reads from DB, then writes to cache. General purpose. Most common pattern.
Write-through App writes to cache AND DB simultaneously. Cache is always up-to-date. Read-heavy, consistency matters.
Write-back (Write-behind) App writes to cache only. Cache async flushes to DB periodically. Write-heavy workloads. Risk: data loss if cache crashes before flush.
Write-around App writes directly to DB, bypassing cache. Cache fills on reads. Write-heavy where written data isn't read immediately.

Redis vs Memcached

Feature Redis Memcached
Data types Strings, lists, sets, hashes, sorted sets, streams Strings only
Persistence Yes (RDB snapshots, AOF) No (pure cache)
Replication Built-in (master-replica) No native replication
Use case Cache + data store (sessions, leaderboards, queues) Simple, volatile caching

Cache Eviction Policies

  • LRU (Least Recently Used): evict the item that hasn't been accessed the longest. Most common.
  • LFU (Least Frequently Used): evict the item accessed the fewest times. Better for skewed access patterns.
  • TTL (Time To Live): each item expires after a set duration. Simple but effective.

Multi-Level Caching

In a real system, caching happens at multiple layers. A request passes through each layer, and any layer can serve the cached response:

Browser cacheCDN edge cacheAPI gateway cacheApplication cache (Redis)Database query cacheDatabase disk

Cache invalidation is one of the two hardest problems in computer science (along with naming things). When the underlying data changes, stale cache entries must be removed or updated. Strategies: TTL expiry, event-driven invalidation (publish cache-bust on write), versioned keys.

Object Storage

Object storage (S3, Google Cloud Storage, Azure Blob Storage) is for unstructured data: files, images, videos, backups, logs. It's not a database — you can't query it with SQL.

Key characteristics: Virtually unlimited storage, extremely cheap ($0.023/GB/month for S3 Standard), 11 nines of durability (99.999999999%), accessible via HTTP (REST API). No random access to file internals — you upload and download whole objects.
  • User uploads: profile pictures, documents, attachments
  • Static assets: CSS, JS, images served via CDN
  • Backups: database dumps, log archives
  • Data lake storage: raw data for analytics (Parquet, JSON, CSV files)
Architecture pattern: Never store files in your database (BLOBs). Store them in S3 and save the S3 URL in your database. This keeps your DB small, backups fast, and lets you serve files via CDN.

Data Lakes & Warehouses

When you need to analyze large volumes of data (not just serve it to users), you need analytics-optimized storage.

Aspect Data Lake Data Warehouse Lakehouse
Schema Schema-on-read (define structure at query time) Schema-on-write (define structure at load time) Both (open table formats)
Data types Any (raw JSON, CSV, Parquet, images, logs) Structured (tables, columns, types) Any, with optional structure
Cost Very cheap (object storage pricing) Expensive (compute + storage bundled) Moderate (cheap storage, pay for compute)
Query speed Slow (must process raw data) Fast (pre-optimized, columnar storage) Fast (indexing + columnar)
Examples S3 + Spark, HDFS Snowflake, BigQuery, Redshift Delta Lake, Apache Iceberg, Hudi
Best for Raw data storage, ML training data, exploration BI dashboards, reporting, analytics Unifying both workloads
The trend: The industry is moving toward the lakehouse pattern — open table formats (Delta Lake, Apache Iceberg) on top of cheap object storage (S3) with fast query engines (Spark, Trino, DuckDB). You get data lake pricing with warehouse query performance.

Database Selection Framework

When someone asks "which database should we use?" in a system design interview, walk through this decision tree:

Step 1 — Data shape: Is your data structured with relationships (users, orders, products)? Start with SQL. Is it flexible, nested, or document-like? Consider a document store. Is it just key-value lookups? Redis or DynamoDB.
Question If Yes Examples
Structured data with relationships? SQL (PostgreSQL, MySQL) Users, orders, inventory, banking
Flexible schema, document-oriented? Document store (MongoDB) CMS, product catalog, user profiles
Simple key lookups, sub-ms latency? Key-value (Redis) Caching, sessions, feature flags
Serverless, auto-scaling key-value? DynamoDB Shopping cart, gaming, IoT
Massive writes, time-series? Wide-column (Cassandra) Sensor data, logs, messaging
Complex relationships, traversals? Graph (Neo4j) Social graphs, fraud detection
Full-text search? Elasticsearch / OpenSearch Product search, log analysis
Files, images, videos? Object storage (S3) User uploads, backups, static assets
Large-scale analytics? Data warehouse / lakehouse BI dashboards, ML training data
Common interview mistake: Saying "I'd use MongoDB because it's popular" without explaining why. Always justify your database choice with the data access pattern, consistency requirements, and scale needs.
Advertisement

Test Yourself

Q: You're building an e-commerce platform with products, orders, and user accounts. Which database would you choose as the primary data store and why?

PostgreSQL (SQL). The data is highly structured with clear relationships (users have orders, orders have items, items reference products). You need ACID transactions for payments and inventory. SQL gives you JOINs for reporting ("top-selling products this week"). Start with PostgreSQL, add Redis for caching hot product pages and sessions, and S3 for product images.

Q: What is the difference between cache-aside and write-through caching? When would you pick each?

Cache-aside: the app reads from cache; on a miss, it reads from the DB and fills the cache. Data may be stale until the next read misses. Write-through: every write goes to both cache and DB simultaneously. The cache is always up-to-date but writes are slower (two writes per operation). Use cache-aside for general-purpose caching where occasional staleness is ok. Use write-through when read consistency matters (e.g., user profile data that must be current after an update).

Q: Why can't you just use Redis as your only database?

Redis stores everything in memory, which is expensive ($$$) compared to disk-based databases. A 1TB Redis cluster costs orders of magnitude more than a 1TB PostgreSQL instance. Redis also has limited query capabilities (no JOINs, no complex filters), limited persistence guarantees (data loss risk on crashes with default config), and all data must fit in RAM. Use Redis for caching and specific use cases (sessions, leaderboards), not as a general-purpose primary database.

Q: When would you choose Cassandra over PostgreSQL?

Choose Cassandra when you have: (1) Massive write throughput needs (millions of writes/second). (2) Time-series or append-only data (sensor readings, logs, event streams). (3) Multi-datacenter deployment with eventual consistency acceptable. (4) Linear scalability needs (just add nodes). Don't choose Cassandra for: complex queries, ad-hoc analytics, data with many relationships, or when you need strong consistency and ACID transactions.

Q: What is the difference between a data lake and a data warehouse? When would you use each?

Data lake stores raw, unstructured data cheaply (schema-on-read). Use it for storing everything — logs, JSON events, images, ML training data — when you don't know how you'll use the data yet. Data warehouse stores structured, processed data with defined schemas (schema-on-write). Use it for BI dashboards, reports, and fast analytical queries. Most companies have both: raw data lands in the lake, gets cleaned/transformed, and is loaded into the warehouse for analysis.

Interview Questions

Q: Design the database schema for a Twitter-like social media platform. What databases would you use and why?

Multiple databases for different needs: (1) PostgreSQL for user accounts, relationships (follows), and metadata — structured data with ACID for account operations. (2) Cassandra for the tweet timeline — massive write throughput (millions of tweets/day), time-series access pattern, partition by user_id and sort by timestamp. (3) Redis for caching hot timelines, user sessions, and trending topics. (4) Elasticsearch for tweet search and hashtag search. (5) S3 for media (images, videos). The key insight is that no single database handles all these access patterns well.

Q: How would you handle database migrations in a system with zero downtime?

Use the expand-contract pattern: (1) Expand: add new column/table alongside the old one. Deploy code that writes to both old and new. (2) Migrate: backfill existing data from old to new. (3) Switch: deploy code that reads from new, still writes to both. (4) Contract: remove old column/table and code that writes to it. Never do destructive schema changes (dropping columns) in a single deployment. Each step is backward-compatible, so you can roll back at any point without data loss.

Q: Explain the concept of "database per microservice." What problems does it solve and create?

Solves: (1) Services are decoupled — each team controls its own schema. (2) Independent scaling — a write-heavy service can use Cassandra while another uses PostgreSQL. (3) Failure isolation — one database going down doesn't take down all services. Creates: (1) No cross-service JOINs — must use API calls or event-driven data sync. (2) Distributed transactions are hard — use saga pattern or eventual consistency. (3) Data duplication across services. (4) More operational overhead (more databases to manage). The trade-off is worth it at scale but overkill for small teams.

Q: Your application's database CPU is at 90% during peak hours. Walk through your optimization strategy.

Systematic approach: (1) Identify slow queries — check slow query logs, run EXPLAIN on top queries. (2) Add indexes on columns used in WHERE, JOIN, ORDER BY. (3) Add a cache layer (Redis) for the most frequent queries. (4) Optimize queries — remove SELECT *, add pagination, avoid N+1 queries. (5) Read replicas for read-heavy workloads. (6) Connection pooling (PgBouncer) to reduce connection overhead. (7) Vertical scale (bigger machine) for quick relief. (8) Denormalize hot tables for read performance. (9) Archive old data to a separate table or cold storage. Only consider sharding as a last resort.