Advanced Log Reading #incident-response #post-mortem #timeline #root-cause

Incident Log Analysis

5 exercises — reconstruct incident timelines, identify root vs proximate cause, recognise cascading failures, write blameless post-mortem timelines, and compose effective incident status updates.

0 / 5 completed

Incident analysis framework

Read logs chronologically — the first anomaly is usually closest to root cause
Root cause — the initiating event (e.g., missing index); proximate cause — immediate trigger (e.g., health check failure)
Blameless principle — timelines describe system events, not people's mistakes
UTC always — post-mortems and status updates must use UTC, not local time
Hypotheses clearly labelled — "likely cause" vs "confirmed cause"

1 / 5

An incident is declared at 03:47 UTC. You are the on-call engineer. Your log search returns several entries between 03:30 and 03:47 UTC. Arrange these events into the correct incident timeline and identify what happened first:

[03:30] INFO order-service: processed 1,240 orders/min (normal baseline)
[03:41] WARN order-service: response time p99=2.1s (threshold: 1.0s)
[03:43] WARN db-primary: connections=485/500 (97% utilised)
[03:44] ERROR order-service: query timeout after 30s on SELECT * FROM orders WHERE status='pending'
[03:47] ERROR order-service: health check failed — database unreachable

What is the correct reconstruction of the incident cause?

2 / 5

During an incident, you find this sequence in the logs:

[04:12:01] INFO auth-service: token validation OK for user usr_5521
[04:12:01] INFO order-service: received POST /orders from usr_5521
[04:12:02] INFO inventory-service: reserved item SKU-881 qty=1 for order ord_9921
[04:12:02] INFO payment-service: charged $49.99 to card ending 4422 for ord_9921
[04:12:02] ERROR notification-service: failed to send confirmation email to user@example.com — SMTP timeout
[04:12:02] ERROR order-service: order ord_9921 marked as FAILED due to notification error

Which part of the system design is this log sequence revealing as problematic?

3 / 5

You are investigating a "users are randomly logged out" complaint. Your log search reveals this pattern recurring every ~6 hours:

{"level":"INFO","msg":"session cache eviction completed","evicted":14823,"cache_size_before":15000,"cache_size_after":177,"duration_ms":8421}

Followed immediately by thousands of these:
{"level":"WARN","msg":"session not found","session_id":"...","action":"redirected to login"}

What is the root cause revealed by these logs?

4 / 5

An incident is being investigated. The team lead asks you to write the "Timeline" section for the post-mortem. Based only on these log entries, which format is correct?

5 / 5

At 14:22 UTC, a spike of 503 errors begins across the platform. You need to write a Slack status update for the #incidents channel. Based on the evidence you have collected so far (503 errors on order-service, last deployment was at 14:15 UTC, logs show upstream timeout from order-service to db-primary), which update is best?