5 exercises — reconstruct incident timelines, identify root vs proximate cause, recognise cascading failures, write blameless post-mortem timelines, and compose effective incident status updates.
0 / 5 completed
Incident analysis framework
Read logs chronologically — the first anomaly is usually closest to root cause
Root cause — the initiating event (e.g., missing index); proximate cause — immediate trigger (e.g., health check failure)
Blameless principle — timelines describe system events, not people's mistakes
UTC always — post-mortems and status updates must use UTC, not local time
Hypotheses clearly labelled — "likely cause" vs "confirmed cause"
1 / 5
An incident is declared at 03:47 UTC. You are the on-call engineer. Your log search returns several entries between 03:30 and 03:47 UTC. Arrange these events into the correct incident timeline and identify what happened first:
[03:30] INFO order-service: processed 1,240 orders/min (normal baseline) [03:41] WARN order-service: response time p99=2.1s (threshold: 1.0s) [03:43] WARN db-primary: connections=485/500 (97% utilised) [03:44] ERROR order-service: query timeout after 30s on SELECT * FROM orders WHERE status='pending' [03:47] ERROR order-service: health check failed — database unreachable
What is the correct reconstruction of the incident cause?
Reading an incident timeline from logs:
Always read logs chronologically when reconstructing an incident. The first anomaly in time is typically the root cause or closest to it.
The causal chain in this incident:
1. 03:30 — Normal state: 1,240 orders/min, normal latency 2. 03:41 — First anomaly: p99 latency spikes to 2.1s. Something is slow. The query is identified in the next entry. 3. 03:43 — Connection pool at 97%. Why? The slow query (2+ seconds each) is holding connections. With 1,240 req/min, slow queries accumulate and exhaust the pool. 4. 03:44 — Query timeouts. Pool exhausted, requests queuing, some hitting 30s timeout. 5. 03:47 — Health check fails. The service can't reach the database at all — connections are completely exhausted or database is overwhelmed.
The likely root cause: SELECT * FROM orders WHERE status='pending' — without an index on status, this is a full-table scan. As the orders table grows, or as the count of "pending" orders grows, this query gets slower. A slow query with high request volume = connection pool saturation.
Incident timeline vocabulary: • root cause — the initiating event that triggered the failure chain • proximate cause — the immediate cause of the failure (health check failure) vs the root cause (slow query) • cascading failure — one component's failure causes downstream failures • contributing factor — condition that made the failure worse (high request volume)
2 / 5
During an incident, you find this sequence in the logs:
[04:12:01] INFO auth-service: token validation OK for user usr_5521 [04:12:01] INFO order-service: received POST /orders from usr_5521 [04:12:02] INFO inventory-service: reserved item SKU-881 qty=1 for order ord_9921 [04:12:02] INFO payment-service: charged $49.99 to card ending 4422 for ord_9921 [04:12:02] ERROR notification-service: failed to send confirmation email to user@example.com — SMTP timeout [04:12:02] ERROR order-service: order ord_9921 marked as FAILED due to notification error
Which part of the system design is this log sequence revealing as problematic?
This log sequence reveals a classic distributed systems design problem: treating non-critical operations as critical-path dependencies.
What happened (reading the log): 1. Authentication ✅ 2. Order received ✅ 3. Inventory reserved ✅ 4. Payment charged ✅ — money was taken from the user 5. Email notification ❌ — SMTP timeout 6. Order marked FAILED ❌ — because of step 5
The design problem: The order processing flow is synchronous and treats all steps as equally critical. But email notification is NOT critical to the business transaction — the payment and inventory steps are. A failed email should not roll back or fail a completed payment.
The correct design: • Mark the order as COMPLETED after payment and inventory steps • Send notifications asynchronously (via a message queue) • If notification fails, retry it — but it should never fail the order • Consider: what does the user see? A paid order showing as "FAILED" is a billing dispute waiting to happen
Key vocabulary for this scenario: • critical path — operations that must succeed for the transaction to complete • non-critical / fire-and-forget — operations that should not block the happy path (notifications, analytics, audit logs) • saga pattern — a way to manage distributed transactions with compensating actions • compensating transaction — undoing a previous step (e.g., refund if payment succeeds but fulfilment fails)
3 / 5
You are investigating a "users are randomly logged out" complaint. Your log search reveals this pattern recurring every ~6 hours:
Followed immediately by thousands of these: {"level":"WARN","msg":"session not found","session_id":"...","action":"redirected to login"}
What is the root cause revealed by these logs?
Reading the evidence:
• evicted: 14,823 out of cache_size_before: 15,000 — 98.8% of all sessions were removed • cache_size_after: 177 — only 177 sessions remained • Immediately after: thousands of "session not found" → mass logouts • Recurring every ~6 hours — NOT random, it is periodic
The "thundering herd" TTL problem: If all sessions are created with the same TTL (e.g., 6-hour expiry) and a large number of users logged in around the same time (e.g., Monday morning login rush), they all expire at the same time. This causes a bulk eviction event that logs out everyone simultaneously.
Solutions to this problem: 1. TTL jitter — add random variation to TTL: instead of exactly 6 hours, use 5.5–6.5 hours. Sessions expire spread out over time. 2. Sliding TTL — reset TTL on each request. Active users never expire. 3. Persistent sessions — store sessions in a database, not just cache
Vocabulary for this scenario: • cache eviction — removing entries from cache (scheduled, capacity-based, or TTL-based) • TTL (Time To Live) — how long a cache entry remains valid • thundering herd — many clients simultaneously triggering the same event (expiry, retry, reconnect) • TTL jitter — randomising expiry times to avoid synchronised eviction • sliding TTL / rolling TTL — TTL resets on each access
4 / 5
An incident is being investigated. The team lead asks you to write the "Timeline" section for the post-mortem. Based only on these log entries, which format is correct?
A post-mortem timeline must be:
1. In UTC — never local times; participants in different time zones need a common reference 2. System-oriented, not person-oriented — what the system did, not what engineers did (the actions section handles engineer responses) 3. Specific with data — exact values from logs (p99=2.1s), not vague descriptions ("things got slow") 4. Chronological — ordered by time, not by discovery order 5. Blameless — the timeline is factual, not judgmental
Why option A is wrong (person-focused): "Alex noticed…" — this puts a person's name and actions in the timeline. In a blameless post-mortem, the timeline traces system events. Engineer actions belong in a separate "Response" section. Person-focused timelines create blame, even when unintentional.
Why option C is wrong (blame language): "someone didn't add a database index" — this is blame, not fact. The correct framing: "the orders.status column lacked an index, resulting in full-table scans under load."
Why option D is wrong (imprecise): "~03:40" and vague descriptions ("pretty bad") are not defensible. Post-mortem timelines must be reconstructable from logs, not from memory.
Post-mortem timeline vocabulary: "[TIME] UTC — [system component]: [event with specific data]" "[TIME] UTC — Incident declared / acknowledged / mitigated / resolved" "Duration of impact: X minutes (from T to T+X)"
5 / 5
At 14:22 UTC, a spike of 503 errors begins across the platform. You need to write a Slack status update for the #incidents channel. Based on the evidence you have collected so far (503 errors on order-service, last deployment was at 14:15 UTC, logs show upstream timeout from order-service to db-primary), which update is best?
Incident status update anatomy:
A good incident update during a live incident includes:
1. Timestamp in UTC — when this update was written 2. What is affected — specific service, specific user-facing behaviour 3. Impact scope — which users/flows are affected 4. Evidence gathered so far — what the logs show 5. Current hypothesis — likely cause (marked as hypothesis, not fact) 6. Actions being taken — what responders are doing right now 7. Next update time — external timeline expectation management
Why option B is best: • Uses UTC • Specific: "order-service", "checkout flows" • Evidence-based: "logs indicate upstream timeout" • Hypothesis qualified: "is a likely contributing factor and is being investigated" — not stated as fact • Action stated: "rollback being evaluated" • Next update promised: "14:45 UTC"
Why option C is dangerous: "The deployment caused the outage" — stated as fact. If rollback doesn't fix it and the real cause is something else, this update misled stakeholders. Use "likely contributing factor" until confirmed.
Incident communication vocabulary: "We are investigating elevated error rates affecting [service/feature]." "Logs indicate [evidence]. The [deployment/change] is a likely factor under investigation." "Mitigation in progress: [action]. ETA: [time or 'unknown']." "We will provide the next update by [time] UTC."