Scalability & Reliability

TL;DR

Scalability means handling more load. Reliability means staying up. Horizontal scaling adds more machines, vertical adds more power. Use replication for availability, sharding for scale, circuit breakers for resilience, and rate limiting for protection.

Explain Like I'm 12

Imagine you run a restaurant. Vertical scaling is making your kitchen bigger — more stoves, more counter space. It works, but eventually you run out of room. Horizontal scaling is opening more restaurants across town. Each one handles its own customers, and together they serve way more people.

Reliability is making sure that if one restaurant catches fire, the others keep running. You have backup plans, spare ingredients, and a manager who can step in if the head chef gets sick. The goal: customers always get served, no matter what goes wrong.

Scalability Overview

Vertical vs Horizontal Scaling

Every system eventually hits a traffic ceiling. The question is: do you make the box bigger, or add more boxes?

Key insight: Vertical scaling is simpler but has a hard ceiling. Horizontal scaling is limitless in theory but requires your application to be designed for it (stateless services, distributed data).

Aspect	Vertical Scaling (Scale Up)	Horizontal Scaling (Scale Out)
What it does	Bigger CPU, more RAM, faster disks	More machines behind a load balancer
Cost curve	Exponential (high-end hardware is disproportionately expensive)	Linear (add commodity machines)
Limit	Hard ceiling (biggest machine available)	Virtually unlimited
Complexity	Low (same code, bigger box)	Higher (load balancing, distributed state, consistency)
Downtime	Often requires restart during upgrade	Zero-downtime (add/remove servers live)
Single point of failure	Yes (one machine)	No (if designed correctly)
Best for	Databases, small-medium traffic, quick wins	Web servers, microservices, high-traffic systems

Real-world approach: Most companies start vertical (it's simpler). They switch to horizontal when they hit the ceiling or need redundancy. You don't have to choose one — many systems use both. Bigger database servers (vertical) plus more app servers (horizontal).

Stateless vs Stateful Services

Horizontal scaling only works well when your services are stateless. Here's why.

Key insight: A stateless service means any server can handle any request. There's no "memory" between requests. This lets you add or remove servers freely because no server is special.

Stateless Services

The server doesn't remember anything between requests. Every request contains all the information needed to process it. Think of it like a cashier who doesn't recognize you — you show your receipt every time.

Session data is stored externally (Redis, database, JWT token)
Any server can handle any request — the load balancer picks freely
Easy to scale: add more servers, remove failed ones, no data loss
Example: REST API servers, serverless functions (AWS Lambda)

Stateful Services

The server remembers something about the client between requests. Like a bartender who knows your "usual."

Sticky sessions required — the load balancer must route the same client to the same server
Harder to scale: if that specific server goes down, the user's state is lost
Example: WebSocket connections, in-memory session stores, game servers

The fix: Move state out of the application server. Store sessions in Redis, files in S3, and caches in Memcached. Now your app servers are stateless and you can scale them horizontally without worry.

Replication

Replication means keeping copies of your data on multiple machines. It's your first line of defense against data loss and your main tool for scaling reads.

Key insight: Replication helps with read scalability and availability, but doesn't solve write scalability. For that, you need sharding.

Master-Slave (Primary-Replica)

One master handles all writes. Multiple replicas handle reads. The master pushes changes to replicas.

Reads scale: add more replicas to handle more read traffic
Writes don't scale: all writes still go to one master
Failover: if the master dies, promote a replica to master (manual or automatic)
Replication lag: replicas may be slightly behind the master (eventual consistency)

Master-Master (Multi-Primary)

Multiple nodes accept both reads and writes. Changes are replicated between them.

Writes scale better: multiple nodes accept writes
Conflict risk: two nodes may update the same row simultaneously — needs conflict resolution
More complex: harder to set up and debug than master-slave
Use case: multi-region deployments where you need low-latency writes in each region

Sync vs Async Replication

Aspect	Synchronous	Asynchronous
How it works	Master waits for replica to confirm before acking the write	Master acks immediately, replica catches up later
Consistency	Strong (replica always up-to-date)	Eventual (replica may lag behind)
Latency	Higher (must wait for replica)	Lower (no waiting)
Risk	Write fails if replica is down	Data loss if master dies before replica catches up

Sharding (Partitioning)

Sharding splits your data across multiple databases. Each shard holds a subset of the total data. This is how you scale writes — each shard handles its own writes independently.

Sharding is a one-way door. It's extremely hard to undo. It adds significant complexity to your application — cross-shard queries, distributed transactions, rebalancing data. Only shard when you've exhausted vertical scaling and read replicas.

Sharding Strategies

Hash-Based Sharding

Hash the shard key (e.g., user_id) and use modulo to pick a shard. Distributes data evenly.

Pro: Even distribution of data
Con: Adding/removing shards requires rehashing (use consistent hashing to minimize this)

Range-Based Sharding

Assign ranges to shards (e.g., users A-M on shard 1, N-Z on shard 2).

Pro: Range queries are efficient (all data in one shard)
Con: Hot spots if distribution is uneven (e.g., more users with names starting with "S")

Consistent Hashing

Nodes are placed on a virtual ring. Data maps to the next node clockwise. When you add/remove a node, only a fraction of data moves. This is how DynamoDB, Cassandra, and many CDNs work.

Shard key selection is critical. A bad shard key creates hot spots (one shard gets all the traffic). Choose a key with high cardinality and even distribution. user_id is usually good. country is usually bad (most users might be in one country).

Cross-Shard Queries

The biggest pain point. If you need data from multiple shards (e.g., "find all orders across all users"), you must query every shard and merge results. This is slow and complex. Design your shard key so that most queries hit a single shard.

CAP Theorem

The CAP theorem says a distributed system can guarantee at most two of three properties:

The real trade-off: Network partitions (P) will happen in any distributed system. So you're really choosing between Consistency (CP) and Availability (AP) during a partition.

Property	What it means	Example
Consistency (C)	Every read returns the most recent write. All nodes see the same data at the same time.	Bank account balance (must be accurate)
Availability (A)	Every request gets a response (even if it might be stale). The system never refuses to answer.	Social media "like" count (ok if slightly stale)
Partition Tolerance (P)	The system keeps working even when network messages are lost or delayed between nodes.	Any multi-datacenter deployment

CP vs AP in Practice

Choose CP (Consistent)	Choose AP (Available)
Banking, financial transactions	Social media feeds, like counts
Inventory management (prevent overselling)	Product catalog (ok if slightly stale)
User authentication (must be accurate)	DNS resolution (cached is fine)
Examples: MongoDB (default), Redis Cluster, HBase	Examples: Cassandra, DynamoDB, CouchDB

Interview tip: Don't say "pick 2 of 3" like it's a menu. Explain that P is non-negotiable in distributed systems, so the real choice is C vs A during a partition. Many systems let you tune this per-query (e.g., Cassandra's consistency levels).

Reliability Patterns

A reliable system keeps working (maybe in a degraded state) even when things go wrong. Here are the patterns that make that happen.

Key principle: Assume everything will fail. Design for failure, not just for the happy path.

Redundancy

Eliminate single points of failure. Every critical component should have at least one backup. Two load balancers (active-passive), multiple app servers, replicated databases, multi-AZ deployment.

Health Checks

The load balancer pings each server periodically. If a server stops responding, traffic is automatically routed away from it. Types: TCP checks (is the port open?), HTTP checks (does /health return 200?), deep checks (can the app reach the database?).

Circuit Breakers

Think of an electrical circuit breaker. If a downstream service starts failing, the circuit "opens" and requests fail fast instead of waiting and timing out. This prevents cascading failures.

Closed: normal operation, requests pass through
Open: service is down, fail immediately (return cached data or error)
Half-open: after a cooldown, let a few requests through to test if the service recovered

Retry with Exponential Backoff

When a request fails, retry — but wait longer between each attempt: 1s, 2s, 4s, 8s... Add jitter (random variation) so thousands of clients don't all retry at the exact same moment (thundering herd).

Without backoff and jitter, retries can make things worse. If a service is overloaded and 10,000 clients all retry at the same time, you just DDOSed your own service.

Rate Limiting

Protect your service by limiting how many requests a client can make in a given time window. Common algorithms:

Token bucket: a bucket fills with tokens at a fixed rate. Each request costs one token. When empty, requests are rejected. Allows bursts up to bucket size.
Sliding window: count requests in a rolling time window. More accurate than fixed windows. Used by most API gateways.
Fixed window: count requests per calendar interval (e.g., per minute). Simple but has edge cases at window boundaries.

Graceful Degradation

When part of the system is overloaded, serve a reduced experience instead of failing completely. Examples: show cached data instead of real-time, disable recommendations but keep search working, serve static pages when the API is down.

SLAs, SLOs, SLIs

How do you measure and promise reliability? Three related but different concepts.

Think of it this way: SLI is what you measure. SLO is what you aim for internally. SLA is what you promise customers (with consequences if you miss it).

Term	What it is	Example
SLI (Service Level Indicator)	The actual metric you measure	99.95% of requests returned in < 200ms
SLO (Service Level Objective)	Internal target for the SLI	We aim for 99.9% uptime
SLA (Service Level Agreement)	Contract with customers (with penalties)	"99.9% uptime or we credit your bill"

The "Nines" of Availability

Availability	Downtime / Year	Downtime / Month	Who targets this
99% (two nines)	3.65 days	7.3 hours	Internal tools, batch jobs
99.9% (three nines)	8.77 hours	43.8 minutes	Most SaaS products
99.99% (four nines)	52.6 minutes	4.38 minutes	E-commerce, fintech
99.999% (five nines)	5.26 minutes	26.3 seconds	Telecom, critical infrastructure

Each additional "nine" is 10x harder. Going from 99.9% to 99.99% doesn't just require better tech — it requires better processes, on-call rotations, automated failover, and multi-region deployments. The cost grows exponentially.

Test Yourself

Q: You have a web app handling 1,000 requests/second. Traffic is expected to grow 10x in the next year. Would you scale vertically or horizontally? Why?

Horizontally. At 10,000 req/s you'll likely exceed what a single machine can handle, plus you need redundancy. Add more stateless app servers behind a load balancer. Keep the database vertically scaled (bigger machine) initially, then add read replicas, and only shard if absolutely necessary.

Q: What is the difference between replication and sharding? When do you use each?

Replication copies the same data to multiple nodes (scales reads, provides redundancy). Sharding splits data across multiple nodes (scales writes and storage). Use replication first — it's simpler and solves most read-heavy workloads. Use sharding when a single database can't handle the write volume or data size.

Q: A social media app has both "bank transfer" and "like count" features. How would you apply the CAP theorem differently to each?

Bank transfers need CP (consistency + partition tolerance). You can't show the wrong balance — it's better to be temporarily unavailable than to show incorrect data. Like counts can use AP (availability + partition tolerance). Showing "1,042 likes" when the real number is "1,043" is fine. Users won't notice, and the system stays responsive.

Q: Your service calls a downstream payment API that's been timing out. What reliability patterns would you apply?

Layer multiple patterns: (1) Circuit breaker to stop calling the failing service and fail fast. (2) Retry with exponential backoff + jitter for transient failures. (3) Timeout on each call so you don't hang forever. (4) Graceful degradation — queue the payment for later processing instead of failing the entire user flow. (5) Health checks to detect when the service recovers.

Q: What does "five nines" availability mean, and why is going from 99.9% to 99.99% so much harder?

"Five nines" (99.999%) means only ~5 minutes of downtime per year. Going from 99.9% (~8.7 hours/year) to 99.99% (~52 minutes/year) is 10x harder because you can't rely on humans to fix problems fast enough. You need automated failover, multi-region redundancy, zero-downtime deployments, and extensive chaos testing. Each "nine" requires exponentially more investment in infrastructure and processes.

Interview Questions

Q: How would you design a system to handle a flash sale where traffic spikes 100x for 10 minutes?

Key strategies: (1) Auto-scaling app servers pre-warmed before the sale starts. (2) CDN for all static assets so origin servers aren't hit. (3) Queue-based processing — accept orders into a message queue and process them asynchronously rather than hitting the database directly. (4) Rate limiting per user to prevent abuse. (5) Cache inventory counts in Redis with atomic decrements to avoid overselling. (6) Graceful degradation — disable non-essential features (recommendations, reviews) during the spike. (7) Circuit breakers on downstream services (payment, email) so failures don't cascade.

Q: Explain consistent hashing and why it's preferred over simple modulo-based sharding.

With modulo hashing (hash(key) % N), adding or removing a server changes N, which reassigns almost every key to a different server — massive data movement. Consistent hashing places servers on a virtual ring. Each key maps to the next server clockwise. When you add a server, only the keys between the new server and its predecessor move (~1/N of the total data). This minimizes redistribution. Virtual nodes (multiple positions per server) improve balance. Used by Cassandra, DynamoDB, Memcached, and most CDNs.

Q: You're designing a global e-commerce platform. How do you handle users in the US and Europe with low latency and compliance with GDPR?

Multi-region architecture: (1) CDN (CloudFront/Cloudflare) serves static assets from edge locations worldwide. (2) Regional deployments — US servers in us-east, EU servers in eu-west, with a global load balancer (Route 53/Cloudflare) routing users to the nearest region based on geo-IP. (3) Data sovereignty — EU user data stays in EU databases (GDPR compliance). Use separate database clusters per region. (4) Cross-region replication for shared catalog data (products, prices) that isn't personal. (5) Eventual consistency is acceptable for the product catalog (AP) but user account/payment data is region-local and strongly consistent (CP).

Q: What is the difference between a load balancer and an API gateway? When would you use both?

A load balancer distributes traffic across multiple instances of the same service (Layer 4/7, health checks, round-robin/least-connections). An API gateway is a single entry point for clients that handles cross-cutting concerns: authentication, rate limiting, request routing to different microservices, response aggregation, protocol translation (REST to gRPC). In practice, you use both: the API gateway sits in front as the public-facing entry point, and each microservice behind it has its own load balancer distributing traffic across its instances. Example: Kong/AWS API Gateway (gateway) + ALB (load balancer per service).

Scalability & Reliability

Scalability Overview

Vertical vs Horizontal Scaling

Stateless vs Stateful Services

Stateless Services

Stateful Services

Replication

Master-Slave (Primary-Replica)

Master-Master (Multi-Primary)

Sync vs Async Replication

Sharding (Partitioning)

Sharding Strategies

Hash-Based Sharding

Range-Based Sharding

Consistent Hashing

Cross-Shard Queries

CAP Theorem

CP vs AP in Practice

Reliability Patterns

Redundancy

Health Checks

Circuit Breakers

Retry with Exponential Backoff

Rate Limiting

Graceful Degradation

SLAs, SLOs, SLIs

The "Nines" of Availability

Test Yourself

Interview Questions

Related Topics