Advanced Reading #scalability #sharding #caching #system-design

Reading Scalability Descriptions

5 exercises on reading scalability architecture: horizontal vs vertical scaling, stateless design, database sharding, caching TTL trade-offs, load balancer placement, and auto-scaling policies.

Scalability vocabulary quick reference

Vertical scaling (scale up) — bigger machine; simple but has a ceiling
Horizontal scaling (scale out) — more machines; requires stateless design
Sharding — split data across multiple DB instances; improves writes, complicates cross-shard queries
Cache hit/miss — hit: fast (Redis); miss: slow (DB fetch + cache store)
TTL — balance: long TTL = high hit rate + stale risk; short TTL = fresh data + more misses
Thrashing — rapidly scaling in and out; prevented by asymmetric evaluation windows

0 / 5 completed

1 / 5

Read this scalability description and answer the question:

Scaling the API Tier

The API tier currently runs on a single server with 16 CPU cores and 64 GB of RAM. During peak traffic, the server reaches 90% CPU utilisation. The team is evaluating two options:

Option A — Vertical scaling: Upgrade the current server to 64 CPU cores and 256 GB of RAM. This requires a maintenance window and a full restart. There is a hard upper limit to how powerful a single machine can be.

Option B — Horizontal scaling: Add more servers of the same size (16 cores, 64 GB each) behind a load balancer. No downtime required — new servers are added to the load balancer pool while the existing server continues handling traffic. The application must be stateless for this to work.

Why must the application be stateless for horizontal scaling to work correctly?

2 / 5

Read this database sharding description and answer the question:

Sharding the User Database

With 200 million user records, a single PostgreSQL instance is becoming a write bottleneck — all writes go to one machine. The team implements horizontal sharding: the user table is split across 8 database shards. Each shard is a separate PostgreSQL instance holding one eighth of the users.

The sharding key is user_id mod 8. A user with ID 1,000 is stored on shard 0 (1000 mod 8 = 0). A user with ID 1,001 is on shard 1 (1001 mod 8 = 1).

A shard router layer sits between the application and the databases. The application sends all queries to the router; the router computes the target shard and forwards the query.

Trade-off: Queries that need data from multiple users (e.g. "show all users who signed up in the last 7 days") must now query all 8 shards in parallel and merge the results — called a scatter-gather query.

Why does sharding improve write throughput but complicate multi-user queries?

3 / 5

Read this caching layer description and answer the question:

Caching Strategy

The product page is the most visited page on the platform, generating 80% of database reads. Each page load triggers 12 database queries to fetch product details, images, pricing, reviews, and recommendations. At peak load, the database serves 45,000 read queries per second — approaching its limit.

The team introduces a read-aside cache (Redis) in front of the database. The application checks Redis first. On a cache hit, the response is returned from Redis in ~0.5ms. On a cache miss, the application queries the database, stores the result in Redis with a TTL (time-to-live) of 60 seconds, and returns the result.

After deployment, 92% of product page reads are served from cache. Database read queries drop from 45,000 to approximately 3,600 per second.

What is a potential downside of setting a 60-second TTL on cached product data?

4 / 5

Read this load balancer placement description and answer the question:

Load Balancer Architecture

The platform uses two layers of load balancers:

Layer 1 — Global Load Balancer (Anycast DNS): Routes users to the nearest regional data centre based on geographic location. A user in London is routed to the EU region; a user in New York is routed to the US-East region. This reduces latency by keeping traffic physically closer to the user.

Layer 2 — Regional Load Balancer (Layer 7 / HTTP): Within each region, an application-aware load balancer distributes requests across the API server pool. It uses a least-connections algorithm — routing each new request to the server with the fewest active connections. It also performs health checks: servers that fail three consecutive health checks are removed from the pool automatically.

What is the advantage of a "least-connections" algorithm over a simple round-robin approach?

5 / 5

Read this auto-scaling description and answer the question:

Auto-scaling Policy

The API server pool uses horizontal auto-scaling. The scaling policy is defined as:

• Scale out (add servers): when average CPU utilisation across the pool exceeds 70% for 2 consecutive minutes, add 2 servers to the pool.
• Scale in (remove servers): when average CPU utilisation drops below 30% for 5 consecutive minutes, remove 1 server from the pool.
• Minimum instances: 3 (never scale below this, even at zero traffic).
• Maximum instances: 20.

The scale-out condition has a shorter evaluation window (2 minutes) than the scale-in condition (5 minutes). This is intentional: respond quickly to traffic spikes, but avoid thrashing — repeatedly adding and removing servers — during brief traffic dips.

Why is the scale-in evaluation window (5 minutes) longer than the scale-out window (2 minutes)?