Scalability & Reliability

TL;DR

Scalability means handling more load. Reliability means staying up. Horizontal scaling adds more machines, vertical adds more power. Use replication for availability, sharding for scale, circuit breakers for resilience, and rate limiting for protection.

Explain Like I'm 12

Imagine you run a restaurant. Vertical scaling is making your kitchen bigger — more stoves, more counter space. It works, but eventually you run out of room. Horizontal scaling is opening more restaurants across town. Each one handles its own customers, and together they serve way more people.

Reliability is making sure that if one restaurant catches fire, the others keep running. You have backup plans, spare ingredients, and a manager who can step in if the head chef gets sick. The goal: customers always get served, no matter what goes wrong.

Scalability Overview

Vertical vs horizontal scaling, with replication and sharding

Vertical vs Horizontal Scaling

Every system eventually hits a traffic ceiling. The question is: do you make the box bigger, or add more boxes?

Key insight: Vertical scaling is simpler but has a hard ceiling. Horizontal scaling is limitless in theory but requires your application to be designed for it (stateless services, distributed data).
Aspect Vertical Scaling (Scale Up) Horizontal Scaling (Scale Out)
What it does Bigger CPU, more RAM, faster disks More machines behind a load balancer
Cost curve Exponential (high-end hardware is disproportionately expensive) Linear (add commodity machines)
Limit Hard ceiling (biggest machine available) Virtually unlimited
Complexity Low (same code, bigger box) Higher (load balancing, distributed state, consistency)
Downtime Often requires restart during upgrade Zero-downtime (add/remove servers live)
Single point of failure Yes (one machine) No (if designed correctly)
Best for Databases, small-medium traffic, quick wins Web servers, microservices, high-traffic systems
Real-world approach: Most companies start vertical (it's simpler). They switch to horizontal when they hit the ceiling or need redundancy. You don't have to choose one — many systems use both. Bigger database servers (vertical) plus more app servers (horizontal).

Stateless vs Stateful Services

Horizontal scaling only works well when your services are stateless. Here's why.

Key insight: A stateless service means any server can handle any request. There's no "memory" between requests. This lets you add or remove servers freely because no server is special.

Stateless Services

The server doesn't remember anything between requests. Every request contains all the information needed to process it. Think of it like a cashier who doesn't recognize you — you show your receipt every time.

  • Session data is stored externally (Redis, database, JWT token)
  • Any server can handle any request — the load balancer picks freely
  • Easy to scale: add more servers, remove failed ones, no data loss
  • Example: REST API servers, serverless functions (AWS Lambda)

Stateful Services

The server remembers something about the client between requests. Like a bartender who knows your "usual."

  • Sticky sessions required — the load balancer must route the same client to the same server
  • Harder to scale: if that specific server goes down, the user's state is lost
  • Example: WebSocket connections, in-memory session stores, game servers
The fix: Move state out of the application server. Store sessions in Redis, files in S3, and caches in Memcached. Now your app servers are stateless and you can scale them horizontally without worry.

Replication

Replication means keeping copies of your data on multiple machines. It's your first line of defense against data loss and your main tool for scaling reads.

Key insight: Replication helps with read scalability and availability, but doesn't solve write scalability. For that, you need sharding.

Master-Slave (Primary-Replica)

One master handles all writes. Multiple replicas handle reads. The master pushes changes to replicas.

  • Reads scale: add more replicas to handle more read traffic
  • Writes don't scale: all writes still go to one master
  • Failover: if the master dies, promote a replica to master (manual or automatic)
  • Replication lag: replicas may be slightly behind the master (eventual consistency)

Master-Master (Multi-Primary)

Multiple nodes accept both reads and writes. Changes are replicated between them.

  • Writes scale better: multiple nodes accept writes
  • Conflict risk: two nodes may update the same row simultaneously — needs conflict resolution
  • More complex: harder to set up and debug than master-slave
  • Use case: multi-region deployments where you need low-latency writes in each region

Sync vs Async Replication

Aspect Synchronous Asynchronous
How it works Master waits for replica to confirm before acking the write Master acks immediately, replica catches up later
Consistency Strong (replica always up-to-date) Eventual (replica may lag behind)
Latency Higher (must wait for replica) Lower (no waiting)
Risk Write fails if replica is down Data loss if master dies before replica catches up

Sharding (Partitioning)

Sharding splits your data across multiple databases. Each shard holds a subset of the total data. This is how you scale writes — each shard handles its own writes independently.

Sharding is a one-way door. It's extremely hard to undo. It adds significant complexity to your application — cross-shard queries, distributed transactions, rebalancing data. Only shard when you've exhausted vertical scaling and read replicas.

Sharding Strategies

Hash-Based Sharding

Hash the shard key (e.g., user_id) and use modulo to pick a shard. Distributes data evenly.

  • Pro: Even distribution of data
  • Con: Adding/removing shards requires rehashing (use consistent hashing to minimize this)

Range-Based Sharding

Assign ranges to shards (e.g., users A-M on shard 1, N-Z on shard 2).

  • Pro: Range queries are efficient (all data in one shard)
  • Con: Hot spots if distribution is uneven (e.g., more users with names starting with "S")

Consistent Hashing

Nodes are placed on a virtual ring. Data maps to the next node clockwise. When you add/remove a node, only a fraction of data moves. This is how DynamoDB, Cassandra, and many CDNs work.

Shard key selection is critical. A bad shard key creates hot spots (one shard gets all the traffic). Choose a key with high cardinality and even distribution. user_id is usually good. country is usually bad (most users might be in one country).

Cross-Shard Queries

The biggest pain point. If you need data from multiple shards (e.g., "find all orders across all users"), you must query every shard and merge results. This is slow and complex. Design your shard key so that most queries hit a single shard.

CAP Theorem

The CAP theorem says a distributed system can guarantee at most two of three properties:

The real trade-off: Network partitions (P) will happen in any distributed system. So you're really choosing between Consistency (CP) and Availability (AP) during a partition.
Property What it means Example
Consistency (C) Every read returns the most recent write. All nodes see the same data at the same time. Bank account balance (must be accurate)
Availability (A) Every request gets a response (even if it might be stale). The system never refuses to answer. Social media "like" count (ok if slightly stale)
Partition Tolerance (P) The system keeps working even when network messages are lost or delayed between nodes. Any multi-datacenter deployment

CP vs AP in Practice

Choose CP (Consistent) Choose AP (Available)
Banking, financial transactions Social media feeds, like counts
Inventory management (prevent overselling) Product catalog (ok if slightly stale)
User authentication (must be accurate) DNS resolution (cached is fine)
Examples: MongoDB (default), Redis Cluster, HBase Examples: Cassandra, DynamoDB, CouchDB
Interview tip: Don't say "pick 2 of 3" like it's a menu. Explain that P is non-negotiable in distributed systems, so the real choice is C vs A during a partition. Many systems let you tune this per-query (e.g., Cassandra's consistency levels).

Reliability Patterns

A reliable system keeps working (maybe in a degraded state) even when things go wrong. Here are the patterns that make that happen.

Key principle: Assume everything will fail. Design for failure, not just for the happy path.

Redundancy

Eliminate single points of failure. Every critical component should have at least one backup. Two load balancers (active-passive), multiple app servers, replicated databases, multi-AZ deployment.

Health Checks

The load balancer pings each server periodically. If a server stops responding, traffic is automatically routed away from it. Types: TCP checks (is the port open?), HTTP checks (does /health return 200?), deep checks (can the app reach the database?).

Circuit Breakers

Think of an electrical circuit breaker. If a downstream service starts failing, the circuit "opens" and requests fail fast instead of waiting and timing out. This prevents cascading failures.

  • Closed: normal operation, requests pass through
  • Open: service is down, fail immediately (return cached data or error)
  • Half-open: after a cooldown, let a few requests through to test if the service recovered

Retry with Exponential Backoff

When a request fails, retry — but wait longer between each attempt: 1s, 2s, 4s, 8s... Add jitter (random variation) so thousands of clients don't all retry at the exact same moment (thundering herd).

Without backoff and jitter, retries can make things worse. If a service is overloaded and 10,000 clients all retry at the same time, you just DDOSed your own service.

Rate Limiting

Protect your service by limiting how many requests a client can make in a given time window. Common algorithms:

  • Token bucket: a bucket fills with tokens at a fixed rate. Each request costs one token. When empty, requests are rejected. Allows bursts up to bucket size.
  • Sliding window: count requests in a rolling time window. More accurate than fixed windows. Used by most API gateways.
  • Fixed window: count requests per calendar interval (e.g., per minute). Simple but has edge cases at window boundaries.

Graceful Degradation

When part of the system is overloaded, serve a reduced experience instead of failing completely. Examples: show cached data instead of real-time, disable recommendations but keep search working, serve static pages when the API is down.

SLAs, SLOs, SLIs

How do you measure and promise reliability? Three related but different concepts.

Think of it this way: SLI is what you measure. SLO is what you aim for internally. SLA is what you promise customers (with consequences if you miss it).
Term What it is Example
SLI (Service Level Indicator) The actual metric you measure 99.95% of requests returned in < 200ms
SLO (Service Level Objective) Internal target for the SLI We aim for 99.9% uptime
SLA (Service Level Agreement) Contract with customers (with penalties) "99.9% uptime or we credit your bill"

The "Nines" of Availability

Availability Downtime / Year Downtime / Month Who targets this
99% (two nines) 3.65 days 7.3 hours Internal tools, batch jobs
99.9% (three nines) 8.77 hours 43.8 minutes Most SaaS products
99.99% (four nines) 52.6 minutes 4.38 minutes E-commerce, fintech
99.999% (five nines) 5.26 minutes 26.3 seconds Telecom, critical infrastructure
Each additional "nine" is 10x harder. Going from 99.9% to 99.99% doesn't just require better tech — it requires better processes, on-call rotations, automated failover, and multi-region deployments. The cost grows exponentially.
Advertisement

Test Yourself

Q: You have a web app handling 1,000 requests/second. Traffic is expected to grow 10x in the next year. Would you scale vertically or horizontally? Why?

Horizontally. At 10,000 req/s you'll likely exceed what a single machine can handle, plus you need redundancy. Add more stateless app servers behind a load balancer. Keep the database vertically scaled (bigger machine) initially, then add read replicas, and only shard if absolutely necessary.

Q: What is the difference between replication and sharding? When do you use each?

Replication copies the same data to multiple nodes (scales reads, provides redundancy). Sharding splits data across multiple nodes (scales writes and storage). Use replication first — it's simpler and solves most read-heavy workloads. Use sharding when a single database can't handle the write volume or data size.

Q: A social media app has both "bank transfer" and "like count" features. How would you apply the CAP theorem differently to each?

Bank transfers need CP (consistency + partition tolerance). You can't show the wrong balance — it's better to be temporarily unavailable than to show incorrect data. Like counts can use AP (availability + partition tolerance). Showing "1,042 likes" when the real number is "1,043" is fine. Users won't notice, and the system stays responsive.

Q: Your service calls a downstream payment API that's been timing out. What reliability patterns would you apply?

Layer multiple patterns: (1) Circuit breaker to stop calling the failing service and fail fast. (2) Retry with exponential backoff + jitter for transient failures. (3) Timeout on each call so you don't hang forever. (4) Graceful degradation — queue the payment for later processing instead of failing the entire user flow. (5) Health checks to detect when the service recovers.

Q: What does "five nines" availability mean, and why is going from 99.9% to 99.99% so much harder?

"Five nines" (99.999%) means only ~5 minutes of downtime per year. Going from 99.9% (~8.7 hours/year) to 99.99% (~52 minutes/year) is 10x harder because you can't rely on humans to fix problems fast enough. You need automated failover, multi-region redundancy, zero-downtime deployments, and extensive chaos testing. Each "nine" requires exponentially more investment in infrastructure and processes.

Interview Questions

Q: How would you design a system to handle a flash sale where traffic spikes 100x for 10 minutes?

Key strategies: (1) Auto-scaling app servers pre-warmed before the sale starts. (2) CDN for all static assets so origin servers aren't hit. (3) Queue-based processing — accept orders into a message queue and process them asynchronously rather than hitting the database directly. (4) Rate limiting per user to prevent abuse. (5) Cache inventory counts in Redis with atomic decrements to avoid overselling. (6) Graceful degradation — disable non-essential features (recommendations, reviews) during the spike. (7) Circuit breakers on downstream services (payment, email) so failures don't cascade.

Q: Explain consistent hashing and why it's preferred over simple modulo-based sharding.

With modulo hashing (hash(key) % N), adding or removing a server changes N, which reassigns almost every key to a different server — massive data movement. Consistent hashing places servers on a virtual ring. Each key maps to the next server clockwise. When you add a server, only the keys between the new server and its predecessor move (~1/N of the total data). This minimizes redistribution. Virtual nodes (multiple positions per server) improve balance. Used by Cassandra, DynamoDB, Memcached, and most CDNs.

Q: You're designing a global e-commerce platform. How do you handle users in the US and Europe with low latency and compliance with GDPR?

Multi-region architecture: (1) CDN (CloudFront/Cloudflare) serves static assets from edge locations worldwide. (2) Regional deployments — US servers in us-east, EU servers in eu-west, with a global load balancer (Route 53/Cloudflare) routing users to the nearest region based on geo-IP. (3) Data sovereignty — EU user data stays in EU databases (GDPR compliance). Use separate database clusters per region. (4) Cross-region replication for shared catalog data (products, prices) that isn't personal. (5) Eventual consistency is acceptable for the product catalog (AP) but user account/payment data is region-local and strongly consistent (CP).

Q: What is the difference between a load balancer and an API gateway? When would you use both?

A load balancer distributes traffic across multiple instances of the same service (Layer 4/7, health checks, round-robin/least-connections). An API gateway is a single entry point for clients that handles cross-cutting concerns: authentication, rate limiting, request routing to different microservices, response aggregation, protocol translation (REST to gRPC). In practice, you use both: the API gateway sits in front as the public-facing entry point, and each microservice behind it has its own load balancer distributing traffic across its instances. Example: Kong/AWS API Gateway (gateway) + ALB (load balancer per service).