Real-World System Designs

TL;DR

Walk through designing 5 real systems: URL Shortener (hashing + key-value store), Chat App (WebSockets + message queue), News Feed (fan-out + caching), Rate Limiter (token bucket), and Notification System (queue + multi-channel delivery).

Explain Like I'm 12

Imagine you're an architect designing buildings. You don't start from scratch every time — you know that a restaurant needs a kitchen, a hospital needs an emergency room, and a school needs classrooms. System design is the same idea but for software.

A URL shortener is like giving a long address a short nickname. A chat app is like passing notes in class, but the teacher (server) makes sure they get to the right person. A news feed is like a bulletin board that magically shows each person the posts they care about most.

How to use this page: For each design, follow the framework: RequirementsEstimationHigh-Level ArchitectureKey ComponentsDatabase ChoiceTrade-offs. This is the same framework interviewers expect.

1. URL Shortener (like bit.ly)

Why this is asked: It's deceptively simple on the surface but tests hashing, database design, caching, and scale estimation. A great warm-up problem.

Requirements

Functional: Given a long URL, generate a short URL. When someone visits the short URL, redirect to the original. Optionally: custom aliases, expiration, analytics (click count).

Non-functional: Low latency redirects (< 50ms). High availability (links should always work). Scale to 100M URLs, 10K redirects/second.

Scale Estimation

  • Write: 1 new URL/second (100M URLs over ~3 years)
  • Read: 100:1 read-to-write ratio = 100 redirects/second (peaks at 10K/s)
  • Storage: 100M URLs * 500 bytes avg = ~50 GB (small)
  • Short code length: Base62 (a-z, A-Z, 0-9) with 7 chars = 62^7 = 3.5 trillion combinations (more than enough)

High-Level Architecture

ClientLoad BalancerApp ServerCache (Redis)Database (DynamoDB or PostgreSQL)

Key Components

  • Short code generation: Use a base62-encoded counter (auto-increment ID) or MD5/SHA256 hash of the URL truncated to 7 chars. Counter approach avoids collisions entirely. Hash approach is stateless but needs collision checking.
  • Redirect flow: User hits short.url/abc1234 → app looks up abc1234 in Redis cache → on miss, checks DB → returns 301 (permanent) or 302 (temporary) redirect to the original URL.
  • Cache strategy: Cache-aside with Redis. Most short URLs follow a power-law distribution (20% of URLs get 80% of traffic). LRU eviction works well.

Database Choice

DynamoDB is ideal — it's a key-value store and the access pattern is pure key lookup (short_code → long_url). Low latency, auto-scales, no joins needed. Alternatively, PostgreSQL with an index on short_code works fine at this scale.

Key Trade-offs

  • 301 vs 302 redirect: 301 (permanent) is cached by the browser, reducing server load but losing analytics. 302 (temporary) always hits your server, enabling click tracking.
  • Hash vs counter: Hash is stateless (no shared counter) but has collision risk. Counter is collision-free but needs a centralized ID generator (single point of failure — mitigate with ranges like Twitter Snowflake).

2. Real-Time Chat (like WhatsApp)

Why this is asked: Tests real-time communication (WebSockets), message delivery guarantees, offline handling, and group messaging fan-out. Very different from request-response HTTP systems.

Requirements

Functional: 1:1 messaging, group chat (up to 500 members), online status, read receipts, media sharing (images/video), message history.

Non-functional: Real-time delivery (< 200ms), messages must never be lost, support 50M daily active users, offline message delivery.

Scale Estimation

  • Concurrent connections: 50M DAU, ~10M concurrent WebSocket connections
  • Messages: 50M users * 40 messages/day = 2B messages/day = ~23K messages/second
  • Storage: 2B messages/day * 100 bytes avg = 200 GB/day = 73 TB/year

High-Level Architecture

ClientWebSocket ServerMessage Queue (Kafka)Chat ServiceMessage Store (Cassandra)

Presence Service tracks online/offline status. Push Notification Service handles offline users.

Key Components

  • WebSocket connections: Persistent bidirectional connection between client and server. Each server handles ~50K connections. A connection manager maps user_id → server_id so messages route to the right server.
  • Message flow (1:1): User A sends message → WebSocket server publishes to Kafka → Chat service looks up User B's connected server → delivers via WebSocket. If B is offline, push notification is triggered and message is stored for later delivery.
  • Group chat fan-out: When someone sends a message to a 200-person group, the chat service fans out to all members. For small groups (< 500), fan-out on write (push to each member). For celebrity accounts with millions of followers, use fan-out on read (see News Feed design).
  • Message ordering: Use a message_id with a timestamp component (like Snowflake IDs) so messages within a conversation are always displayed in the correct order.

Database Choice

Cassandra for message storage. The access pattern is "fetch messages for conversation X, sorted by time" — perfect for Cassandra's partition key (conversation_id) + clustering key (timestamp). Handles massive write throughput. Redis for the user-to-server mapping and online status tracking.

Key Trade-offs

  • Push vs pull for delivery: Push (WebSocket) for real-time. Pull (HTTP polling) as a fallback for flaky connections. Long polling is a middle ground.
  • Message persistence: Store all messages (expensive but users expect history) vs. ephemeral messages (Snapchat model). Most chat apps store everything.
  • End-to-end encryption: Encrypted content means the server can't read messages (good for privacy), but it also means no server-side search, no spam detection, and more complex key management.

3. News Feed (like Twitter/Facebook)

The most frequently asked system design problem in interviews. The core challenge is the "celebrity problem" — how do you efficiently distribute content from someone with 50M followers?

Requirements

Functional: Users create posts. Users see a personalized feed of posts from people they follow. Feed is ranked (not just chronological). Support media (images, videos).

Non-functional: Feed loads in < 500ms. Support 500M users, 10M concurrent. Near-real-time updates (new post appears in followers' feeds within seconds).

Scale Estimation

  • Posts: 500M users * 10% post daily = 50M new posts/day = ~580 posts/second
  • Feed reads: 500M users * 5 feed refreshes/day = 2.5B reads/day = ~29K reads/second
  • Average user follows: 200 people. Max (celebrity): 50M followers.

The Core Problem: Fan-Out

When User A posts something, how do you get it into the feeds of all their followers?

Approach How it works Pros Cons
Fan-out on write (push) When a user posts, immediately write the post to every follower's feed cache Feed reads are instant (pre-computed). Simple read path. Celebrity problem: 50M followers = 50M writes per post. Wasted work for inactive followers.
Fan-out on read (pull) When a user opens their feed, fetch posts from all people they follow and merge No wasted writes. Works for celebrities. Slow read path (merge 200+ timelines). High read latency.
Hybrid (what Twitter/Facebook use) Fan-out on write for normal users. Fan-out on read for celebrities (> 10K followers). Best of both. Fast reads. No celebrity bottleneck. More complex. Two code paths.

High-Level Architecture

Post Service receives new posts → Fan-out Service distributes to followers' feed caches → Feed Cache (Redis) stores pre-computed feeds → Feed Service reads from cache on request, merging in celebrity posts on-the-fly.

Key Components

  • Feed cache: Redis sorted set per user, scored by post timestamp. Keep only the latest 500 posts per feed. When a user scrolls, paginate through the sorted set.
  • Ranking: Don't show purely chronological. Rank by engagement (likes, comments), recency, relationship strength (close friends rank higher), and content type. An ML model scores each candidate post.
  • Media handling: Upload images/videos to S3, generate thumbnails async, serve via CDN. Store only the media URL in the post record.

Database Choice

PostgreSQL for user accounts, follow relationships. Cassandra for post storage (partition by user_id, sort by timestamp). Redis for pre-computed feed caches. Elasticsearch for post search. S3 + CDN for media.

Key Trade-offs

  • Consistency vs freshness: It's acceptable for a post to appear in feeds after a few seconds (eventual consistency). Don't sacrifice read latency for real-time updates.
  • Storage vs compute: Fan-out on write uses more storage (copies of posts in every follower's cache) but less compute at read time. The opposite for fan-out on read.

4. Rate Limiter

Why this is asked: Rate limiting is a fundamental building block in any production system. It tests your knowledge of algorithms, distributed systems (how to rate limit across multiple servers), and where to place it in the architecture.

Requirements

Functional: Limit the number of requests a client can make in a time window (e.g., 100 requests per minute per user). Return HTTP 429 (Too Many Requests) when limit is exceeded. Support different limits for different endpoints.

Non-functional: Low latency (the check should add < 1ms to each request). Distributed (works across multiple API servers). Accurate (no significant over-counting or under-counting).

Where to Place It

  • API Gateway: Most common. The gateway (Kong, AWS API Gateway, Nginx) checks rate limits before the request reaches your service. Centralized and easy to manage.
  • Middleware: Inside your application, as a middleware layer. More control but duplicated across services.
  • Service mesh sidecar: Istio/Envoy handle rate limiting at the network level. Good for microservices.

Algorithms

Token Bucket

A bucket holds tokens that refill at a fixed rate. Each request costs one token. When the bucket is empty, requests are rejected. Allows bursts (up to the bucket size) while enforcing an average rate.

  • Parameters: bucket size (max burst), refill rate (tokens per second)
  • Example: Bucket size 10, refill rate 2/sec. A user can burst 10 requests instantly, then sustain 2/sec.
  • Used by: Amazon, Stripe API

Sliding Window Log

Store the timestamp of every request. Count requests in the last N seconds. Most accurate but memory-intensive (stores every timestamp).

Sliding Window Counter

Hybrid: divide time into fixed windows and weight the previous window's count by overlap. Example: if the current window is 30% through, count = (current window count) + (previous window count * 0.7). Good balance of accuracy and memory.

Fixed Window Counter

Count requests in fixed time intervals (e.g., per minute starting at :00). Simple but has an edge case: a user can send 2x the limit by bursting at the boundary (end of one window + start of the next).

Algorithm Accuracy Memory Burst handling
Token bucket Good Low (2 values per user) Allows controlled bursts
Sliding window log Exact High (all timestamps) No bursts
Sliding window counter Good (approximate) Low (2 counters per user) Smoothed
Fixed window Boundary edge case Very low (1 counter) 2x burst at boundaries

Distributed Rate Limiting

When you have multiple API servers, they all need to share the rate limit state. Use Redis as the central counter store. Each server calls Redis (INCR + EXPIRE) to atomically check and increment the counter. Redis is single-threaded, so operations are naturally serialized.

Race condition risk: Without atomic operations, two servers might both read "99 requests" and both allow the 100th — allowing 101 total. Use Redis Lua scripts or the INCR command (which is atomic) to prevent this.

Key Trade-offs

  • Hard vs soft limits: Hard limit rejects requests immediately. Soft limit allows some overflow and logs it. Use hard limits for security-sensitive endpoints (login), soft limits for general API usage.
  • Local vs global rate limiting: Local (per-server) is fast but inaccurate when traffic isn't evenly distributed. Global (Redis) is accurate but adds a network round-trip. Most systems use global with local caching.

5. Notification System

Why this is asked: Notifications span multiple delivery channels (push, SMS, email), require priority handling, retry logic, and user preference management. It's a good test of queue-based architecture.

Requirements

Functional: Send notifications via push (iOS/Android), SMS, and email. Users can set preferences (opt-in/out per channel). Priority levels (urgent, normal, low). Templates with dynamic content.

Non-functional: Deliver urgent notifications in < 1 second. Handle 10M notifications/day. Retry failed deliveries. Track delivery status (sent, delivered, read).

Scale Estimation

  • Volume: 10M notifications/day = ~115/second average, with peaks at 5-10x (morning, events)
  • Channels: 60% push, 30% email, 10% SMS (varies by app)
  • Storage: Notification log — 10M/day * 1KB = 10 GB/day (small)

High-Level Architecture

Trigger Service (events from other services) → Notification Service (orchestration, template rendering, user preferences) → Priority Queue (Kafka/SQS)Channel Workers (Push Worker, Email Worker, SMS Worker) → Third-Party Providers (APNs, FCM, SendGrid, Twilio)

Key Components

  • Notification service: Receives events ("user X received a payment"), looks up user preferences (does X want push notifications?), renders the message template, and routes to the appropriate queue(s).
  • Priority queue: Urgent notifications (security alerts, OTPs) go to a high-priority queue that's consumed first. Normal notifications (social updates) go to a standard queue. This ensures a burst of marketing emails doesn't delay a password reset OTP.
  • Template engine: Store notification templates with placeholders: "Hi {{name}}, your order {{order_id}} has shipped!" The service fills in the variables per notification. Supports localization (language per user).
  • Channel workers: Separate worker pools for each channel (push, email, SMS). Each pulls from its queue, calls the third-party provider (APNs for iOS push, FCM for Android, SendGrid for email, Twilio for SMS), and records the delivery status.
  • Delivery tracking: Record each notification's lifecycle: created → queued → sent → delivered → read. Use this for analytics and debugging. Store in a time-series database or log store.

Database Choice

PostgreSQL for user preferences and notification templates. Kafka or SQS for the message queues (Kafka for high throughput, SQS for simplicity). Cassandra or DynamoDB for the notification log (time-series, write-heavy). Redis for deduplication (prevent sending the same notification twice within a window).

Key Trade-offs

  • At-least-once vs exactly-once delivery: Exactly-once is extremely hard in distributed systems. Most notification systems use at-least-once with deduplication. A user might receive a duplicate push notification on rare occasions — that's better than missing one.
  • Batching vs immediate: Batch similar notifications ("5 people liked your photo" instead of 5 separate notifications) to avoid notification fatigue. But OTPs and security alerts must be immediate.
  • Rate limiting per user: Don't send more than X notifications per hour to a single user. Even if the system triggers 20 events, batch or suppress to prevent spam.
Retry strategy: When a third-party provider fails (SendGrid returns 500), retry with exponential backoff: 1s, 2s, 4s, 8s... After N failures, move to a dead-letter queue for manual investigation. Don't retry 4xx errors (invalid token, user opted out) — those are permanent failures.
Advertisement

Test Yourself

Q: In a URL shortener, what happens if two users submit the same long URL? Should they get the same short code or different ones?

It depends on requirements. Same short code (dedup): saves storage, but you can't track per-user analytics or let one user delete their link without affecting the other. Different short codes: each user gets a unique link with independent analytics and lifecycle. Most production shorteners (bit.ly) create unique short codes per submission, even for the same URL, because users expect to own their links.

Q: How does the "celebrity problem" affect news feed design? What's the solution?

A celebrity with 50M followers posting triggers 50M write operations if you fan-out on write — this creates a massive spike. The solution is the hybrid approach: use fan-out on write for normal users (pre-compute their feeds) and fan-out on read for celebrities (fetch their posts at read time and merge). When a user opens their feed, their pre-computed feed is served from cache, and the system fetches recent posts from the celebrities they follow and merges them in real-time. The threshold is typically ~10K followers.

Q: Why is the token bucket algorithm preferred over fixed window for rate limiting?

Fixed window has a boundary problem: a user can send 100 requests at 11:59:59 and another 100 at 12:00:00, effectively getting 200 requests in 2 seconds while the "limit" is 100/minute. Token bucket avoids this because it tracks a continuous refill rate. It also allows controlled bursts (send 10 requests instantly) while enforcing an average rate over time, which matches real user behavior better.

Q: In a chat system, how do you ensure messages are delivered to a user who is currently offline?

Store the message in the database (Cassandra) regardless of whether the recipient is online. If the recipient is online, also push via WebSocket. If offline, queue a push notification (APNs/FCM). When the user comes back online, the client pulls all undelivered messages since their last seen timestamp. The server marks messages as delivered. This guarantees no message is ever lost, even if the push notification fails to reach the device.

Q: How would you handle notification deduplication? (Prevent sending the same alert twice)

Use an idempotency key for each notification event. Before processing, check if this key exists in Redis (with a TTL of the dedup window, e.g., 1 hour). If it exists, skip. If not, add the key and process the notification. The key can be a hash of (user_id + event_type + entity_id + time_window). This prevents retries and duplicate events from causing multiple notifications. Store the key in Redis for fast O(1) lookup.

Interview Questions

Q: Design a search autocomplete system (like Google search suggestions). What are the key components?

Key components: (1) Trie data structure (prefix tree) stores popular search queries. Each node is a character; traversing from root to a node gives a prefix. Each node stores the top-K completions for that prefix. (2) Data collection: Log all search queries. Aggregate top queries per prefix using a batch job (daily MapReduce) or real-time streaming (Kafka + Flink). (3) Serving layer: The trie is loaded into memory across multiple servers. When a user types "app", lookup the "a" → "p" → "p" node and return its top-K suggestions. Response time must be < 100ms. (4) Optimization: Cache the top prefixes (top 1000 prefixes serve 80% of queries). Filter inappropriate suggestions. Personalize based on user history. (5) Scale: Shard the trie by prefix range (a-m on shard 1, n-z on shard 2). Replicate for availability.

Q: Design a file storage service like Google Drive. How do you handle large file uploads and syncing across devices?

Architecture: (1) Chunked upload: Split large files into chunks (e.g., 4MB each). Upload chunks in parallel. If a chunk fails, retry only that chunk. Resume interrupted uploads. (2) Storage: Store file chunks in object storage (S3). Store metadata (file name, path, permissions, chunk list) in PostgreSQL. (3) Sync: Use a sync service that maintains a version vector per file. When a file changes on one device, the client sends only the changed chunks (delta sync). The server notifies other devices via long polling or WebSocket. (4) Conflict resolution: If two devices edit the same file simultaneously, keep both versions and let the user choose (like Google Drive's "conflicting copy"). (5) Deduplication: Hash each chunk with SHA256. If the same chunk already exists in storage, don't store it again (content-addressable storage). This saves significant space for similar files. (6) Sharing and permissions: ACL (access control list) per file/folder stored in the database. Support owner, editor, viewer roles.

Q: You need to design a system that processes 1 million events per second. Walk through your architecture.

Event ingestion pipeline: (1) Ingestion: Clients send events to an API gateway with load balancing. Servers validate and immediately publish to Kafka (partitioned by event key for ordering). Kafka handles 1M+ events/second across partitions. (2) Processing: Kafka consumers (worker pools) process events. Use Apache Flink or Kafka Streams for real-time stream processing (aggregations, windowing, filtering). Scale consumers by adding more partitions. (3) Storage: Processed results go to Cassandra (time-series), Elasticsearch (search), or S3 (data lake) depending on the use case. (4) Backpressure: If consumers fall behind, Kafka retains events (configurable retention). No data loss. (5) Monitoring: Track consumer lag (how far behind real-time each consumer is). Alert if lag exceeds threshold. (6) Fault tolerance: Kafka replicates across brokers. Consumers checkpoint their offset. On failure, restart from last checkpoint (at-least-once delivery).

Q: How would you design a global content delivery network (CDN) from scratch?

Core design: (1) Edge servers deployed in 100+ locations worldwide (Points of Presence). Each edge has SSD storage for cached content. (2) DNS-based routing: When a user requests content, DNS resolves to the nearest edge server based on geolocation or latency measurements (anycast routing). (3) Cache hierarchy: Edge cache → Regional cache (mid-tier) → Origin server. If the edge doesn't have the content, it checks the regional cache before going to origin. This reduces origin load. (4) Cache invalidation: TTL-based (content expires after N seconds) + purge API (origin tells CDN to drop specific content after an update). (5) Consistent hashing within a PoP to distribute content across edge servers (avoids duplication within one location). (6) TLS termination at the edge for low-latency HTTPS. (7) DDoS protection by absorbing traffic across distributed edge network. (8) Analytics: Log every request for bandwidth, cache hit ratio, and latency metrics.