Real-World System Designs
Walk through designing 5 real systems: URL Shortener (hashing + key-value store), Chat App (WebSockets + message queue), News Feed (fan-out + caching), Rate Limiter (token bucket), and Notification System (queue + multi-channel delivery).
Explain Like I'm 12
Imagine you're an architect designing buildings. You don't start from scratch every time — you know that a restaurant needs a kitchen, a hospital needs an emergency room, and a school needs classrooms. System design is the same idea but for software.
A URL shortener is like giving a long address a short nickname. A chat app is like passing notes in class, but the teacher (server) makes sure they get to the right person. A news feed is like a bulletin board that magically shows each person the posts they care about most.
1. URL Shortener (like bit.ly)
Requirements
Functional: Given a long URL, generate a short URL. When someone visits the short URL, redirect to the original. Optionally: custom aliases, expiration, analytics (click count).
Non-functional: Low latency redirects (< 50ms). High availability (links should always work). Scale to 100M URLs, 10K redirects/second.
Scale Estimation
- Write: 1 new URL/second (100M URLs over ~3 years)
- Read: 100:1 read-to-write ratio = 100 redirects/second (peaks at 10K/s)
- Storage: 100M URLs * 500 bytes avg = ~50 GB (small)
- Short code length: Base62 (a-z, A-Z, 0-9) with 7 chars = 62^7 = 3.5 trillion combinations (more than enough)
High-Level Architecture
Client → Load Balancer → App Server → Cache (Redis) → Database (DynamoDB or PostgreSQL)
Key Components
- Short code generation: Use a base62-encoded counter (auto-increment ID) or MD5/SHA256 hash of the URL truncated to 7 chars. Counter approach avoids collisions entirely. Hash approach is stateless but needs collision checking.
- Redirect flow: User hits
short.url/abc1234→ app looks upabc1234in Redis cache → on miss, checks DB → returns 301 (permanent) or 302 (temporary) redirect to the original URL. - Cache strategy: Cache-aside with Redis. Most short URLs follow a power-law distribution (20% of URLs get 80% of traffic). LRU eviction works well.
Database Choice
DynamoDB is ideal — it's a key-value store and the access pattern is pure key lookup (short_code → long_url). Low latency, auto-scales, no joins needed. Alternatively, PostgreSQL with an index on short_code works fine at this scale.
Key Trade-offs
- 301 vs 302 redirect: 301 (permanent) is cached by the browser, reducing server load but losing analytics. 302 (temporary) always hits your server, enabling click tracking.
- Hash vs counter: Hash is stateless (no shared counter) but has collision risk. Counter is collision-free but needs a centralized ID generator (single point of failure — mitigate with ranges like Twitter Snowflake).
2. Real-Time Chat (like WhatsApp)
Requirements
Functional: 1:1 messaging, group chat (up to 500 members), online status, read receipts, media sharing (images/video), message history.
Non-functional: Real-time delivery (< 200ms), messages must never be lost, support 50M daily active users, offline message delivery.
Scale Estimation
- Concurrent connections: 50M DAU, ~10M concurrent WebSocket connections
- Messages: 50M users * 40 messages/day = 2B messages/day = ~23K messages/second
- Storage: 2B messages/day * 100 bytes avg = 200 GB/day = 73 TB/year
High-Level Architecture
Client ↔ WebSocket Server ↔ Message Queue (Kafka) ↔ Chat Service → Message Store (Cassandra)
Presence Service tracks online/offline status. Push Notification Service handles offline users.
Key Components
- WebSocket connections: Persistent bidirectional connection between client and server. Each server handles ~50K connections. A connection manager maps user_id → server_id so messages route to the right server.
- Message flow (1:1): User A sends message → WebSocket server publishes to Kafka → Chat service looks up User B's connected server → delivers via WebSocket. If B is offline, push notification is triggered and message is stored for later delivery.
- Group chat fan-out: When someone sends a message to a 200-person group, the chat service fans out to all members. For small groups (< 500), fan-out on write (push to each member). For celebrity accounts with millions of followers, use fan-out on read (see News Feed design).
- Message ordering: Use a message_id with a timestamp component (like Snowflake IDs) so messages within a conversation are always displayed in the correct order.
Database Choice
Cassandra for message storage. The access pattern is "fetch messages for conversation X, sorted by time" — perfect for Cassandra's partition key (conversation_id) + clustering key (timestamp). Handles massive write throughput. Redis for the user-to-server mapping and online status tracking.
Key Trade-offs
- Push vs pull for delivery: Push (WebSocket) for real-time. Pull (HTTP polling) as a fallback for flaky connections. Long polling is a middle ground.
- Message persistence: Store all messages (expensive but users expect history) vs. ephemeral messages (Snapchat model). Most chat apps store everything.
- End-to-end encryption: Encrypted content means the server can't read messages (good for privacy), but it also means no server-side search, no spam detection, and more complex key management.
3. News Feed (like Twitter/Facebook)
Requirements
Functional: Users create posts. Users see a personalized feed of posts from people they follow. Feed is ranked (not just chronological). Support media (images, videos).
Non-functional: Feed loads in < 500ms. Support 500M users, 10M concurrent. Near-real-time updates (new post appears in followers' feeds within seconds).
Scale Estimation
- Posts: 500M users * 10% post daily = 50M new posts/day = ~580 posts/second
- Feed reads: 500M users * 5 feed refreshes/day = 2.5B reads/day = ~29K reads/second
- Average user follows: 200 people. Max (celebrity): 50M followers.
The Core Problem: Fan-Out
When User A posts something, how do you get it into the feeds of all their followers?
| Approach | How it works | Pros | Cons |
|---|---|---|---|
| Fan-out on write (push) | When a user posts, immediately write the post to every follower's feed cache | Feed reads are instant (pre-computed). Simple read path. | Celebrity problem: 50M followers = 50M writes per post. Wasted work for inactive followers. |
| Fan-out on read (pull) | When a user opens their feed, fetch posts from all people they follow and merge | No wasted writes. Works for celebrities. | Slow read path (merge 200+ timelines). High read latency. |
| Hybrid (what Twitter/Facebook use) | Fan-out on write for normal users. Fan-out on read for celebrities (> 10K followers). | Best of both. Fast reads. No celebrity bottleneck. | More complex. Two code paths. |
High-Level Architecture
Post Service receives new posts → Fan-out Service distributes to followers' feed caches → Feed Cache (Redis) stores pre-computed feeds → Feed Service reads from cache on request, merging in celebrity posts on-the-fly.
Key Components
- Feed cache: Redis sorted set per user, scored by post timestamp. Keep only the latest 500 posts per feed. When a user scrolls, paginate through the sorted set.
- Ranking: Don't show purely chronological. Rank by engagement (likes, comments), recency, relationship strength (close friends rank higher), and content type. An ML model scores each candidate post.
- Media handling: Upload images/videos to S3, generate thumbnails async, serve via CDN. Store only the media URL in the post record.
Database Choice
PostgreSQL for user accounts, follow relationships. Cassandra for post storage (partition by user_id, sort by timestamp). Redis for pre-computed feed caches. Elasticsearch for post search. S3 + CDN for media.
Key Trade-offs
- Consistency vs freshness: It's acceptable for a post to appear in feeds after a few seconds (eventual consistency). Don't sacrifice read latency for real-time updates.
- Storage vs compute: Fan-out on write uses more storage (copies of posts in every follower's cache) but less compute at read time. The opposite for fan-out on read.
4. Rate Limiter
Requirements
Functional: Limit the number of requests a client can make in a time window (e.g., 100 requests per minute per user). Return HTTP 429 (Too Many Requests) when limit is exceeded. Support different limits for different endpoints.
Non-functional: Low latency (the check should add < 1ms to each request). Distributed (works across multiple API servers). Accurate (no significant over-counting or under-counting).
Where to Place It
- API Gateway: Most common. The gateway (Kong, AWS API Gateway, Nginx) checks rate limits before the request reaches your service. Centralized and easy to manage.
- Middleware: Inside your application, as a middleware layer. More control but duplicated across services.
- Service mesh sidecar: Istio/Envoy handle rate limiting at the network level. Good for microservices.
Algorithms
Token Bucket
A bucket holds tokens that refill at a fixed rate. Each request costs one token. When the bucket is empty, requests are rejected. Allows bursts (up to the bucket size) while enforcing an average rate.
- Parameters: bucket size (max burst), refill rate (tokens per second)
- Example: Bucket size 10, refill rate 2/sec. A user can burst 10 requests instantly, then sustain 2/sec.
- Used by: Amazon, Stripe API
Sliding Window Log
Store the timestamp of every request. Count requests in the last N seconds. Most accurate but memory-intensive (stores every timestamp).
Sliding Window Counter
Hybrid: divide time into fixed windows and weight the previous window's count by overlap. Example: if the current window is 30% through, count = (current window count) + (previous window count * 0.7). Good balance of accuracy and memory.
Fixed Window Counter
Count requests in fixed time intervals (e.g., per minute starting at :00). Simple but has an edge case: a user can send 2x the limit by bursting at the boundary (end of one window + start of the next).
| Algorithm | Accuracy | Memory | Burst handling |
|---|---|---|---|
| Token bucket | Good | Low (2 values per user) | Allows controlled bursts |
| Sliding window log | Exact | High (all timestamps) | No bursts |
| Sliding window counter | Good (approximate) | Low (2 counters per user) | Smoothed |
| Fixed window | Boundary edge case | Very low (1 counter) | 2x burst at boundaries |
Distributed Rate Limiting
When you have multiple API servers, they all need to share the rate limit state. Use Redis as the central counter store. Each server calls Redis (INCR + EXPIRE) to atomically check and increment the counter. Redis is single-threaded, so operations are naturally serialized.
INCR command (which is atomic) to prevent this.
Key Trade-offs
- Hard vs soft limits: Hard limit rejects requests immediately. Soft limit allows some overflow and logs it. Use hard limits for security-sensitive endpoints (login), soft limits for general API usage.
- Local vs global rate limiting: Local (per-server) is fast but inaccurate when traffic isn't evenly distributed. Global (Redis) is accurate but adds a network round-trip. Most systems use global with local caching.
5. Notification System
Requirements
Functional: Send notifications via push (iOS/Android), SMS, and email. Users can set preferences (opt-in/out per channel). Priority levels (urgent, normal, low). Templates with dynamic content.
Non-functional: Deliver urgent notifications in < 1 second. Handle 10M notifications/day. Retry failed deliveries. Track delivery status (sent, delivered, read).
Scale Estimation
- Volume: 10M notifications/day = ~115/second average, with peaks at 5-10x (morning, events)
- Channels: 60% push, 30% email, 10% SMS (varies by app)
- Storage: Notification log — 10M/day * 1KB = 10 GB/day (small)
High-Level Architecture
Trigger Service (events from other services) → Notification Service (orchestration, template rendering, user preferences) → Priority Queue (Kafka/SQS) → Channel Workers (Push Worker, Email Worker, SMS Worker) → Third-Party Providers (APNs, FCM, SendGrid, Twilio)
Key Components
- Notification service: Receives events ("user X received a payment"), looks up user preferences (does X want push notifications?), renders the message template, and routes to the appropriate queue(s).
- Priority queue: Urgent notifications (security alerts, OTPs) go to a high-priority queue that's consumed first. Normal notifications (social updates) go to a standard queue. This ensures a burst of marketing emails doesn't delay a password reset OTP.
- Template engine: Store notification templates with placeholders: "Hi {{name}}, your order {{order_id}} has shipped!" The service fills in the variables per notification. Supports localization (language per user).
- Channel workers: Separate worker pools for each channel (push, email, SMS). Each pulls from its queue, calls the third-party provider (APNs for iOS push, FCM for Android, SendGrid for email, Twilio for SMS), and records the delivery status.
- Delivery tracking: Record each notification's lifecycle: created → queued → sent → delivered → read. Use this for analytics and debugging. Store in a time-series database or log store.
Database Choice
PostgreSQL for user preferences and notification templates. Kafka or SQS for the message queues (Kafka for high throughput, SQS for simplicity). Cassandra or DynamoDB for the notification log (time-series, write-heavy). Redis for deduplication (prevent sending the same notification twice within a window).
Key Trade-offs
- At-least-once vs exactly-once delivery: Exactly-once is extremely hard in distributed systems. Most notification systems use at-least-once with deduplication. A user might receive a duplicate push notification on rare occasions — that's better than missing one.
- Batching vs immediate: Batch similar notifications ("5 people liked your photo" instead of 5 separate notifications) to avoid notification fatigue. But OTPs and security alerts must be immediate.
- Rate limiting per user: Don't send more than X notifications per hour to a single user. Even if the system triggers 20 events, batch or suppress to prevent spam.
Test Yourself
Q: In a URL shortener, what happens if two users submit the same long URL? Should they get the same short code or different ones?
Q: How does the "celebrity problem" affect news feed design? What's the solution?
Q: Why is the token bucket algorithm preferred over fixed window for rate limiting?
Q: In a chat system, how do you ensure messages are delivered to a user who is currently offline?
Q: How would you handle notification deduplication? (Prevent sending the same alert twice)
Interview Questions
Q: Design a search autocomplete system (like Google search suggestions). What are the key components?
Q: Design a file storage service like Google Drive. How do you handle large file uploads and syncing across devices?
Q: You need to design a system that processes 1 million events per second. Walk through your architecture.
Q: How would you design a global content delivery network (CDN) from scratch?