Reading Engineering Blog Posts — English for Developers

1 / 5

Passage: Database Sharding — Engineering Blog Post

Title: How We Cut Our p99 Latency by 60% — Lessons From Sharding Our Database

For three years, our single Postgres primary handled everything. It worked beautifully —
until it didn't. By Q2 our write throughput had tripled, and the p99 latency on our
checkout endpoint crept from 120 ms to over 800 ms during peak hours. Vertical scaling
bought us time, but we were running on the largest instance our provider offered. We had
hit a ceiling.

We considered three options:

1. Read replicas. These help read-heavy workloads, but our bottleneck was writes.
Replicas would not have moved the needle.
2. Caching everything. We already cached aggressively; the remaining traffic was
genuinely dynamic.
3. Sharding. Splitting the data across multiple primaries by customer ID.

We chose sharding, fully aware it is the option of last resort. It introduces real
complexity: cross-shard queries become expensive, transactions can no longer span the
whole dataset, and re-sharding later is painful.

The migration took four months. The hardest part was not the code — it was the
backfill. We had to copy two years of historical data into the new shards without
downtime, using a dual-write phase where every write went to both the old and new
systems while we verified consistency.

The result: p99 dropped to 320 ms and write throughput is now effectively unbounded.
Would we do it again? Yes — but only because we genuinely had no alternative. If you
can solve your problem with a read replica or a cache, do that first. Sharding is
powerful, but the operational cost is permanent.

What is the main argument the author makes about sharding?

2 / 5

Passage: Database Sharding — Engineering Blog Post

Title: How We Cut Our p99 Latency by 60% — Lessons From Sharding Our Database

For three years, our single Postgres primary handled everything. It worked beautifully —
until it didn't. By Q2 our write throughput had tripled, and the p99 latency on our
checkout endpoint crept from 120 ms to over 800 ms during peak hours. Vertical scaling
bought us time, but we were running on the largest instance our provider offered. We had
hit a ceiling.

We considered three options:

1. Read replicas. These help read-heavy workloads, but our bottleneck was writes.
Replicas would not have moved the needle.
2. Caching everything. We already cached aggressively; the remaining traffic was
genuinely dynamic.
3. Sharding. Splitting the data across multiple primaries by customer ID.

We chose sharding, fully aware it is the option of last resort. It introduces real
complexity: cross-shard queries become expensive, transactions can no longer span the
whole dataset, and re-sharding later is painful.

The migration took four months. The hardest part was not the code — it was the
backfill. We had to copy two years of historical data into the new shards without
downtime, using a dual-write phase where every write went to both the old and new
systems while we verified consistency.

The result: p99 dropped to 320 ms and write throughput is now effectively unbounded.
Would we do it again? Yes — but only because we genuinely had no alternative. If you
can solve your problem with a read replica or a cache, do that first. Sharding is
powerful, but the operational cost is permanent.

The author calls the backfill "the hardest part." What can you infer about why it was so difficult?

3 / 5

Passage: Database Sharding — Engineering Blog Post

Title: How We Cut Our p99 Latency by 60% — Lessons From Sharding Our Database

For three years, our single Postgres primary handled everything. It worked beautifully —
until it didn't. By Q2 our write throughput had tripled, and the p99 latency on our
checkout endpoint crept from 120 ms to over 800 ms during peak hours. Vertical scaling
bought us time, but we were running on the largest instance our provider offered. We had
hit a ceiling.

We considered three options:

1. Read replicas. These help read-heavy workloads, but our bottleneck was writes.
Replicas would not have moved the needle.
2. Caching everything. We already cached aggressively; the remaining traffic was
genuinely dynamic.
3. Sharding. Splitting the data across multiple primaries by customer ID.

We chose sharding, fully aware it is the option of last resort. It introduces real
complexity: cross-shard queries become expensive, transactions can no longer span the
whole dataset, and re-sharding later is painful.

The migration took four months. The hardest part was not the code — it was the
backfill. We had to copy two years of historical data into the new shards without
downtime, using a dual-write phase where every write went to both the old and new
systems while we verified consistency.

The result: p99 dropped to 320 ms and write throughput is now effectively unbounded.
Would we do it again? Yes — but only because we genuinely had no alternative. If you
can solve your problem with a read replica or a cache, do that first. Sharding is
powerful, but the operational cost is permanent.

In context, what does the author mean by "Vertical scaling bought us time"?

4 / 5

Passage: Monolith to Microservices — Engineering Blog Post

Title: We Migrated From a Monolith to Microservices. Here's What Nobody Told Us.

The promise of microservices is seductive: independent deployments, team autonomy,
and the ability to scale services individually. After eighteen months on the other
side, we can confirm those benefits are real. But they came at a cost we underestimated.

What we expected and got:
  - Teams ship independently. No more coordinated "release trains."
  - We scale the payment service separately from the search service.

What surprised us:
  - Debugging got dramatically harder. A single user request now touches seven
    services. When something breaks, you cannot just read one stack trace — you need
    distributed tracing, and you need it from day one. We added it on day ninety,
    and those first three months were genuinely painful.
  - "Simple" changes now span multiple repositories and require coordinated deploys
    anyway — exactly the coupling we were trying to escape.
  - Local development became a nightmare. Running the full system on a laptop is no
    longer practical.

If we started over, we would not begin with microservices. We would build a
well-structured monolith first, with clear internal module boundaries, and extract
services only when a specific scaling or team-ownership pain forced our hand. The
boundaries you draw on a whiteboard before you understand the domain are almost
always wrong.

What is the author's overall stance on starting a new project with microservices?

5 / 5

Passage: Monolith to Microservices — Engineering Blog Post

Title: We Migrated From a Monolith to Microservices. Here's What Nobody Told Us.

The promise of microservices is seductive: independent deployments, team autonomy,
and the ability to scale services individually. After eighteen months on the other
side, we can confirm those benefits are real. But they came at a cost we underestimated.

What we expected and got:
  - Teams ship independently. No more coordinated "release trains."
  - We scale the payment service separately from the search service.

What surprised us:
  - Debugging got dramatically harder. A single user request now touches seven
    services. When something breaks, you cannot just read one stack trace — you need
    distributed tracing, and you need it from day one. We added it on day ninety,
    and those first three months were genuinely painful.
  - "Simple" changes now span multiple repositories and require coordinated deploys
    anyway — exactly the coupling we were trying to escape.
  - Local development became a nightmare. Running the full system on a laptop is no
    longer practical.

If we started over, we would not begin with microservices. We would build a
well-structured monolith first, with clear internal module boundaries, and extract
services only when a specific scaling or team-ownership pain forced our hand. The
boundaries you draw on a whiteboard before you understand the domain are almost
always wrong.

The author writes that they added distributed tracing "on day ninety." What does the article imply you should do differently?