System Design Interview Questions

TL;DR

30+ system design interview questions covering fundamentals, scalability, databases, caching, real-world design problems, and estimation questions with structured frameworks. Click "Show Answer" to reveal. Perfect for a focused revision session before your interview.

The framework that wins interviews: Requirements (clarify scope) → Estimation (back-of-envelope math) → API Design (key endpoints) → High-Level Design (components + data flow) → Deep Dive (scaling, trade-offs) → Trade-offs (justify decisions). Interviewers care more about your approach than the "right" answer.

Fundamentals

Q: What is a load balancer and why do you need one?

A load balancer distributes incoming traffic across multiple backend servers. You need one for: (1) Scalability — spread load across servers so no single server is overwhelmed. (2) Availability — if one server goes down, the load balancer routes traffic to healthy ones. (3) Flexibility — add/remove servers without affecting users. Common algorithms: round-robin (simple rotation), least connections (send to the least busy server), IP hash (same client always hits the same server). Layer 4 (TCP level) is faster. Layer 7 (HTTP level) can route based on URL paths, headers, or cookies. Examples: Nginx, HAProxy, AWS ALB.

Q: Explain caching strategies. When should you cache, and what are the risks?

Cache when: data is read frequently, expensive to compute, and doesn't change often. Strategies: (1) Cache-aside: app checks cache; on miss, reads DB and fills cache. Most common. (2) Write-through: write to cache and DB simultaneously. Always consistent. (3) Write-back: write to cache only; flush to DB later. Fast writes but data loss risk. (4) Write-around: write to DB only; cache fills on reads. Risks: Stale data (cache doesn't reflect latest DB changes), cache stampede (many requests hit DB simultaneously when cache expires), memory pressure (cache grows too large). Mitigation: TTL for staleness, cache warming on deploy, LRU eviction for memory.

Q: What is a CDN? How does it differ from an origin server?

A CDN (Content Delivery Network) is a geographically distributed network of servers that caches and serves content close to users. The origin server is where the original content lives (your actual web server). How it works: user requests a file → DNS routes to nearest CDN edge server → if cached, serve immediately (cache hit). If not, fetch from origin, cache it, then serve (cache miss). Benefits: lower latency (content served from a server 50ms away instead of 200ms), reduced origin load, DDoS absorption, TLS termination at edge. Best for: static assets (images, CSS, JS, videos), API responses that don't change per-user. Examples: Cloudflare, CloudFront, Akamai, Fastly.

Q: Compare REST, GraphQL, and gRPC. When would you use each?

REST: Resource-based URLs (GET /users/123), HTTP verbs, JSON responses. Simple, widely understood, great for public APIs. Downside: over-fetching (get more data than needed) or under-fetching (need multiple requests). GraphQL: Single endpoint, client specifies exactly which fields to return. Eliminates over/under-fetching. Great for mobile apps (save bandwidth), complex UIs with many data sources. Downside: complexity, caching is harder, N+1 query risk on server. gRPC: Binary protocol (Protocol Buffers), HTTP/2, bidirectional streaming. Extremely fast, strongly typed. Great for internal microservice communication. Downside: not human-readable, limited browser support. Rule of thumb: REST for public APIs, GraphQL for client-facing APIs with complex data needs, gRPC for internal service-to-service calls.

Q: What is a message queue? When would you use one?

A message queue is a buffer between a producer (sender) and consumer (processor) that decouples them. The producer publishes a message, the queue holds it, and the consumer processes it when ready. Use cases: (1) Async processing: user uploads a video → immediately return "processing" → a worker encodes it later. (2) Decoupling services: Order service publishes "order placed" event; Inventory, Email, and Analytics services each consume it independently. (3) Load leveling: absorb traffic spikes — the queue buffers requests and workers process at a steady rate. (4) Guaranteed delivery: if a consumer crashes, the message stays in the queue for retry. Examples: Kafka (high-throughput event streaming), RabbitMQ (traditional message broker), SQS (managed, simple), Redis Streams.

Want deeper coverage? See System Design Core Concepts.

Scalability & Reliability

Q: Explain the difference between horizontal and vertical scaling. When would you choose each?

Vertical scaling (scale up): add more CPU, RAM, or storage to a single machine. Simple (no code changes), but has a ceiling (biggest machine available) and is a single point of failure. Horizontal scaling (scale out): add more machines. No ceiling, built-in redundancy, but requires stateless services, load balancing, and distributed data management. Choose vertical for databases (simpler), quick wins, and small-medium traffic. Choose horizontal for web servers, microservices, and systems that need high availability. Most production systems use both: horizontally scaled stateless app servers + vertically scaled database with read replicas.

Q: Explain the CAP theorem. Why can't you have all three?

CAP: Consistency (every read returns the latest write), Availability (every request gets a response), Partition Tolerance (system works despite network failures between nodes). You can't have all three because: during a network partition, a node must choose between (a) rejecting the request to stay consistent (CP — sacrifice availability) or (b) serving possibly stale data to stay available (AP — sacrifice consistency). Partition tolerance is mandatory in distributed systems (networks fail), so the real choice is CP vs AP. CP examples: MongoDB, HBase — used for banking, inventory. AP examples: Cassandra, DynamoDB — used for social media, product catalogs. Many systems let you tune this per-query (Cassandra consistency levels).

Q: What are the main sharding strategies? What makes a good shard key?

Strategies: (1) Hash-based: hash(shard_key) % N — distributes evenly but adding/removing shards requires rehashing (use consistent hashing to minimize data movement). (2) Range-based: shard by value ranges (A-M, N-Z) — good for range queries but can create hot spots. (3) Geographic: shard by region/country — good for data locality and compliance. Good shard key: high cardinality (many unique values), even distribution, used in most queries (avoids cross-shard queries). user_id is usually ideal. Bad shard key: low cardinality (country, status), creates hot spots, frequently used in aggregation queries that need data from all shards.

Q: What does "five nines" availability mean? How do you achieve it?

"Five nines" = 99.999% uptime = only 5.26 minutes of downtime per year. To achieve it: (1) Multi-region deployment with automated failover (if an entire data center goes down, traffic shifts). (2) No single points of failure — redundant load balancers, database replicas, multiple app servers. (3) Automated health checks and self-healing — Kubernetes restarts failed pods, load balancers remove unhealthy servers. (4) Zero-downtime deployments (blue-green, canary). (5) Chaos engineering — regularly inject failures to find weaknesses (Netflix Chaos Monkey). (6) On-call with < 5 minute response time. Each additional "nine" is 10x harder and more expensive. Most companies target 99.9% (three nines) or 99.99% (four nines).

Q: What is the circuit breaker pattern? How does it prevent cascading failures?

Like an electrical circuit breaker — it "trips" when a downstream service fails too many times, preventing further calls. Three states: (1) Closed (normal) — requests pass through. Track failure rate. (2) Open (tripped) — all requests fail immediately without calling the downstream service. Return cached data, a fallback, or an error. (3) Half-open (testing) — after a cooldown period, let a few requests through. If they succeed, close the circuit (recovered). If they fail, open again. Prevents cascading failures because: without a circuit breaker, Service A keeps calling failing Service B, tying up threads and connections. Those back up, causing Service A to slow down, which backs up Service C that calls Service A, and so on. The circuit breaker stops this chain by failing fast.

Deeper coverage: Scalability & Reliability Deep Dive

Databases

Q: When would you choose SQL over NoSQL? Give specific examples.

Choose SQL when: (1) Data is structured with clear relationships (users, orders, products) — foreign keys and JOINs are essential. (2) You need ACID transactions (banking, inventory, payments). (3) Complex queries and reporting (aggregations, window functions). (4) Data integrity is critical (constraints, triggers). Examples: E-commerce (orders + items + users), banking (accounts + transactions), ERP systems. Choose NoSQL when: (1) Flexible/evolving schema (user-generated content, product catalogs with varying attributes). (2) Need massive write throughput (IoT sensor data, logs). (3) Simple access patterns (key-value lookups, document retrieval). (4) Horizontal scaling is a priority. Examples: Chat messages (Cassandra), session storage (Redis), content management (MongoDB).

Q: What are the common use cases for Redis?

Redis is an in-memory data store with rich data structures. Use cases: (1) Caching: cache DB query results, API responses, rendered HTML. (2) Session storage: store user sessions with TTL (auto-expire). (3) Rate limiting: atomic INCR + EXPIRE per user key. (4) Leaderboards: sorted sets with ZADD/ZRANGE — O(log N) insert and O(log N + M) range queries. (5) Pub/sub: real-time notifications, chat message routing. (6) Distributed locks: SETNX (set if not exists) for coordinating across services. (7) Queues: LPUSH/RPOP for simple job queues. (8) Counting: HyperLogLog for approximate unique counts (e.g., unique visitors) using only 12KB of memory. Limitation: all data must fit in RAM, which is expensive. Not a replacement for a disk-based database.

Q: What is database indexing? How do indexes speed up queries?

An index is a separate data structure (usually a B-tree) that maps column values to row locations, enabling fast lookups. Without an index, the database scans every row (full table scan, O(n)). With a B-tree index, it finds the value in O(log n). Types: (1) B-tree: balanced tree, supports equality and range queries. Default in PostgreSQL/MySQL. (2) Hash index: O(1) for exact matches but no range queries. (3) Composite index: covers multiple columns, follows "leftmost prefix" rule. (4) Covering index: includes all columns needed by a query, so the DB never reads the actual table (index-only scan). Trade-off: indexes speed up reads but slow down writes (every INSERT/UPDATE/DELETE must also update the index). Don't index everything — index columns used in WHERE, JOIN, and ORDER BY.

Q: What is the difference between normalization and denormalization? When would you denormalize?

Normalization: organize data to eliminate redundancy. Split data into related tables (users, orders, products). Update in one place. Enforces data integrity. But reads require JOINs (slower for complex queries). Denormalization: intentionally add redundancy for read performance. Store computed/copied data in the same table (e.g., store customer_name directly in the orders table instead of JOINing to customers). Denormalize when: (1) Read performance is critical and JOINs are too slow. (2) Data warehousing / analytics (star schemas are denormalized by design). (3) Caching (materialized views). (4) NoSQL databases (no JOINs available, so you must denormalize). Trade-off: denormalization trades write complexity (must update multiple copies) for read speed.

Q: What is eventual consistency? Give an example where it's acceptable.

Eventual consistency: after a write, replicas may not have the latest data immediately, but they will converge to the same value eventually (usually within milliseconds to seconds). Acceptable when: (1) Social media like/comment counts — showing 1,042 likes instead of 1,043 for a few seconds is fine. (2) Product catalog prices — a 2-second delay in a price update is acceptable. (3) DNS propagation — changes take minutes/hours to propagate globally. (4) Search index updates — new content appearing in search after a slight delay is expected. Not acceptable when: bank balances, inventory counts (overselling), authentication state, medical records. These need strong consistency.

Deeper coverage: Databases & Storage Deep Dive

Design Problems

Q: Design a URL shortener like bit.ly.

Requirements: Generate short URL from long URL, redirect short → long, handle 100M URLs, 10K redirects/sec. Short code: Base62 encoding of auto-increment ID (7 chars = 3.5T combinations). Alternative: MD5 hash truncated to 7 chars with collision check. Architecture: Client → Load Balancer → App Server → Redis Cache → DynamoDB. Write flow: generate short code, store mapping in DB and cache. Read flow: check Redis first (O(1)), on miss check DynamoDB, return 301/302 redirect. Trade-offs: 301 (browser caches, lose analytics) vs 302 (always hits server, enables click tracking). Counter (no collisions, needs central ID gen) vs hash (stateless, collision risk). Use consistent hashing for the ID generator to avoid single point of failure.

Q: Design a real-time chat system like WhatsApp.

Requirements: 1:1 and group messaging, 50M DAU, real-time delivery, offline message handling, read receipts. Protocol: WebSocket for persistent bidirectional connection. Each server handles ~50K connections. Architecture: Client ↔ WebSocket Server ↔ Kafka ↔ Chat Service → Cassandra (messages) + Redis (user→server mapping, online status). 1:1 flow: User A sends message → WebSocket server publishes to Kafka → Chat service looks up B's server → delivers via WebSocket. If B is offline: store in Cassandra, trigger push notification. Group chat: Fan-out on write for small groups (<500). Message ordering: Snowflake-style IDs (timestamp + machine + sequence). Storage: Cassandra — partition by conversation_id, sort by timestamp. Handles massive write throughput.

Q: Design a rate limiter for an API.

Requirements: Limit requests per client per time window, return 429 when exceeded, distributed (works across servers), < 1ms overhead. Algorithm: Token bucket (allows bursts) or sliding window counter (smoother). Where: API gateway (most common) or middleware. Distributed implementation: Use Redis as central counter. For token bucket: store (tokens_remaining, last_refill_timestamp) per user. Atomic Lua script: calculate tokens to add since last refill, subtract 1 for current request, reject if < 0. For sliding window: Redis sorted set with request timestamps. ZRANGEBYSCORE to count requests in window, ZADD to add new request, ZREMRANGEBYSCORE to clean old entries. Headers: Return X-RateLimit-Limit, X-RateLimit-Remaining, X-RateLimit-Reset so clients can self-throttle.

Q: Design a notification system that handles push, email, and SMS.

Requirements: Multi-channel delivery, user preferences, priority levels, 10M notifications/day, retry on failure. Architecture: Trigger events → Notification Service (preference lookup, template rendering, routing) → Priority Queue (Kafka/SQS with separate high/normal/low queues) → Channel Workers (Push Worker calls APNs/FCM, Email Worker calls SendGrid, SMS Worker calls Twilio). Key decisions: (1) Separate queues per priority — OTP shouldn't wait behind marketing emails. (2) Idempotency keys in Redis to prevent duplicate sends. (3) Template engine with localization. (4) Delivery tracking: created → queued → sent → delivered → read. (5) Rate limiting per user to prevent notification spam. Retry: Exponential backoff for 5xx errors. Dead-letter queue after N failures. Don't retry 4xx (permanent failures).

Q: Design a search autocomplete system (type-ahead suggestions).

Requirements: Show top-5 suggestions as user types, < 100ms response time, suggestions based on popularity. Data structure: Trie (prefix tree). Each node stores top-K completions for that prefix. Data pipeline: Collect search queries → aggregate by frequency (batch: daily MapReduce on search logs, or real-time: Kafka + Flink) → build/update trie. Serving: Trie loaded in memory across multiple servers. User types "app" → traverse to node "a-p-p" → return stored top-5 suggestions. Optimization: (1) Cache top 1000 prefixes (covers 80% of queries). (2) Only query server after user pauses typing (debounce 200ms). (3) Browser caches results for prefixes already fetched. Scale: Shard trie by prefix range across servers. Replicate each shard for availability.

Q: Design a file storage service like Google Drive.

Requirements: Upload/download files, sync across devices, sharing with permissions, version history, 50M users. Architecture: Client ↔ API Gateway → Metadata Service (PostgreSQL) + Block Service (S3). Upload: Split file into 4MB chunks, hash each chunk (SHA256), upload in parallel. Store chunk hashes + order in metadata DB. Deduplication: if a chunk hash already exists in S3, skip upload (content-addressable storage). Sync: Client maintains local file manifest with hashes. On change: compute delta (changed chunks only), upload only those chunks. Notify other devices via long polling / WebSocket. Each device pulls changed chunks. Conflict resolution: If two devices edit simultaneously, keep both versions as "conflicting copies" — let user resolve. Versioning: Each save creates a new version entry pointing to the chunk list at that point. Old chunks aren't deleted (support version rollback). Sharing: ACL per file/folder in metadata DB (owner, editor, viewer).

Full walkthroughs: Real-World Designs Deep Dive

Estimation Questions

Back-of-envelope estimation tips: Round aggressively (1 day = ~100K seconds, 1 month = ~2.5M seconds). Show your work. State assumptions clearly. Interviewers care about your process, not exact numbers.

Q: Estimate the QPS (queries per second) for Twitter.

Assumptions: 400M monthly active users. 200M daily active users (50% DAU/MAU). Each user refreshes their feed ~5 times/day. Each refresh fetches ~20 tweets. Timeline reads: 200M * 5 = 1B timeline reads/day = 1B / 86,400 = ~12K QPS. Peak = 2-3x average = ~30K QPS. Tweet writes: 200M DAU * 10% tweet daily = 20M tweets/day = ~230 writes/second. Reads are 50x writes — read-heavy system. Fan-out: Average user has 200 followers. 20M tweets * 200 followers = 4B fan-out operations/day = ~46K fan-out writes/second. This is why the celebrity problem matters — a user with 50M followers creates 50M write operations per tweet.

Q: Estimate the storage needed for YouTube per day.

Assumptions: 500 hours of video uploaded per minute (public stat). That's 720,000 hours/day. Average video: 1080p at ~3 GB/hour (compressed). Raw storage: 720,000 * 3 GB = ~2.16 PB/day. But YouTube stores multiple resolutions (144p, 360p, 720p, 1080p, 4K) = roughly 5x = ~10 PB/day. Plus thumbnails, metadata, captions = negligible compared to video. Annual: ~3.6 EB/year. This is why YouTube uses custom hardware and erasure coding (not simple replication) for cost efficiency. Bandwidth: If 1B daily active viewers watch 30 minutes at 720p (~1.5 GB/hour = 0.75 GB per session) = 750 PB/day in outbound bandwidth — served mostly from CDN edge caches.

Q: How many servers do you need for 1M concurrent users?

Depends on the workload: (1) Simple HTTP API: A modern server handles ~10K concurrent connections (event-driven like Node.js or Go). With 64GB RAM and good code: 1M / 10K = ~100 servers. (2) WebSocket (chat): Each connection uses ~10KB of memory. 1M * 10KB = 10GB RAM. A 64GB server can hold ~50K connections (leaving room for app logic). 1M / 50K = ~20 servers. (3) Compute-heavy (video encoding): If each request uses 1 CPU-second and you have 32-core servers: 32 concurrent requests per server. 1M concurrent / 32 = ~31,250 servers (unrealistic — you'd queue this). Key point: always specify the workload type and whether "concurrent" means "connected" or "actively processing."

Q: Estimate the bandwidth needed for a video streaming service.

Assumptions: 10M concurrent viewers. Average bitrate: 5 Mbps (1080p adaptive streaming). Total bandwidth: 10M * 5 Mbps = 50 Tbps (terabits per second). For context, a single large CDN PoP might handle 10-40 Tbps, so you need multiple PoPs worldwide. Daily data transfer: If average viewer watches 2 hours/day with 100M daily viewers: 100M * 2 hours * 2.25 GB/hour (5 Mbps) = 450 PB/day. Cost: At $0.01/GB for CDN bandwidth: 450M GB * $0.01 = $4.5M/day = $1.6B/year. This is why Netflix, YouTube, and other streaming services build their own CDN infrastructure (Netflix Open Connect) instead of relying on commercial CDNs.

Trade-off Questions

Q: Consistency vs availability — when would you sacrifice each?

Sacrifice availability for consistency (CP): Banking systems (showing wrong balance is unacceptable), inventory management (overselling is costly), authentication (wrong auth state is a security hole), medical records. The system may temporarily refuse requests during a partition, but data is always accurate. Sacrifice consistency for availability (AP): Social media feeds (stale like count is fine), DNS (old cached record still works), product catalogs (slightly outdated price for a few seconds), search indexes (new content appearing with slight delay). The system always responds, but data might be briefly stale. Key insight: most systems aren't purely CP or AP. They use different consistency levels for different operations within the same system (strong consistency for payments, eventual consistency for feeds).

Q: Monolith vs microservices — when should you choose each?

Monolith: Single deployable unit. Choose when: small team (< 10 engineers), early-stage product (need to iterate fast), simple domain, don't yet know service boundaries. Pros: simple deployment, easy debugging, no network overhead between services, shared database. Microservices: Independent services, each with its own database. Choose when: large team (50+ engineers), well-understood domain with clear boundaries, different services need different tech stacks or scaling profiles, organizational autonomy needed. Pros: independent deployment, fault isolation, team autonomy, technology flexibility. Cons: distributed system complexity (network failures, eventual consistency), operational overhead (service discovery, tracing, monitoring), harder debugging. The path: start monolith, extract services as you grow. Don't start with microservices unless you have a strong reason.

Q: SQL vs NoSQL — what factors drive the decision?

Deciding factors: (1) Data model: Structured with relationships? SQL. Flexible/nested? Document store. Key-value? Redis/DynamoDB. (2) Consistency needs: Need ACID transactions? SQL. Eventual consistency acceptable? NoSQL scales easier. (3) Query patterns: Complex queries with JOINs and aggregations? SQL. Simple lookups by key? NoSQL. (4) Scale profile: Write-heavy at massive scale? Cassandra. Read-heavy with complex queries? PostgreSQL with read replicas. (5) Team expertise: SQL is universally known. Each NoSQL database has its own learning curve. The answer is usually "both": PostgreSQL for core transactional data + Redis for caching + S3 for files + Elasticsearch for search. Don't frame it as either/or.

Q: Push vs pull model for a news feed — what are the trade-offs?

Push (fan-out on write): When a user posts, immediately write to every follower's feed. Pros: instant feed reads (pre-computed), simple read path. Cons: celebrity problem (50M writes per post), wasted work for inactive users, high write amplification. Pull (fan-out on read): When a user opens their feed, fetch recent posts from everyone they follow. Pros: no wasted writes, handles celebrities naturally. Cons: slow read path (merge N timelines), high read latency, more compute at read time. The real answer: Hybrid. Push for normal users (< 10K followers) + pull for celebrities. This is what Twitter and Facebook actually use. It optimizes for the common case (fast reads) while handling the edge case (celebrities) gracefully.

Q: Synchronous vs asynchronous processing — when should you use each?

Synchronous: The caller waits for the result. Use when: the user needs the result immediately (login, payment confirmation, reading a profile), the operation is fast (< 200ms), the result determines the next step. Asynchronous: The caller gets an immediate acknowledgment; processing happens in the background. Use when: the operation is slow (video encoding, email sending, report generation), the user doesn't need the result immediately, you need to absorb traffic spikes (queue + workers). Implementation: message queues (Kafka, SQS, RabbitMQ). Pattern: Accept request → return 202 Accepted with a job ID → client polls for status or receives a webhook callback. Example: uploading a video is sync (fast, returns URL), processing/encoding is async (slow, notifies when done).