Advanced Log Reading #incident-response #post-mortem #timeline #root-cause

Incident Log Analysis

5 exercises — reconstruct incident timelines, identify root vs proximate cause, recognise cascading failures, write blameless post-mortem timelines, and compose effective incident status updates.

0 / 5 completed
Incident analysis framework
  • Read logs chronologically — the first anomaly is usually closest to root cause
  • Root cause — the initiating event (e.g., missing index); proximate cause — immediate trigger (e.g., health check failure)
  • Blameless principle — timelines describe system events, not people's mistakes
  • UTC always — post-mortems and status updates must use UTC, not local time
  • Hypotheses clearly labelled — "likely cause" vs "confirmed cause"
1 / 5
An incident is declared at 03:47 UTC. You are the on-call engineer. Your log search returns several entries between 03:30 and 03:47 UTC. Arrange these events into the correct incident timeline and identify what happened first:

[03:30] INFO order-service: processed 1,240 orders/min (normal baseline)
[03:41] WARN order-service: response time p99=2.1s (threshold: 1.0s)
[03:43] WARN db-primary: connections=485/500 (97% utilised)
[03:44] ERROR order-service: query timeout after 30s on SELECT * FROM orders WHERE status='pending'
[03:47] ERROR order-service: health check failed — database unreachable

What is the correct reconstruction of the incident cause?