Interpret real-world monitoring data and communicate about it in professional English. Five exercises covering latency percentiles, throughput, throttles, spikes, and incident description.
⚠ Alarm: "HighP99Latency" — TRIGGERED at 14:03 UTC
How would you describe this situation in a team Slack message?
What makes option C correct: • States the alarm name and trigger time • Explains what p99 means in plain language ("slowest 1%") • Notes what is NOT affected (volume, CPU) — helps narrow down the cause • Proposes a hypothesis (slow code path, DB/external call) • Avoids broadcasting raw numbers without interpretation
Key vocabulary: • p50 / p95 / p99 — percentile latency: p99 = 99% of requests complete within this time • alarm threshold — a configured value that triggers a notification when exceeded • baseline — the normal, expected measurement under typical conditions • CPU utilisation — the percentage of compute capacity currently in use • error rate — the proportion of requests that result in an error
Useful phrases: • "We have a latency spike on the p99 — investigating now." • "The alarm fired at 14:03 UTC. p50 looks healthy, so this is likely affecting a narrow segment of requests." • "CPU is under 40%, so this isn't a resource saturation issue."
2 / 5
A Grafana chart shows a sharp spike at 03:14 UTC:
Before 03:14: CPU ~22%, Memory ~4.1 GB, DB connections: 8 At 03:14: CPU → 94%, Memory → 7.9 GB, DB connections → 247 At 03:22: CPU → 23%, Memory → 4.3 GB, DB connections → 11
Which description of this data is most accurate for an incident report?
Option A is correct because it: • Names all three affected metrics explicitly • Gives exact start and end times from the chart • Calculates the duration (03:14–03:22 = 8 minutes) • Uses professional language ("spiked", "returned to normal") • Avoids vague language ("maximum", "issues")
Reading this chart: • The spike pattern (sharp rise, sustained peak, sharp recovery) is typical of a batch job, scheduled task, or traffic burst • DB connections jumping from 8 to 247 suggests a connection pool exhaustion event — likely a query without a timeout or a loop that opened connections without releasing them • Memory roughly doubling alongside DB connections suggests large result sets being held in memory
Key vocabulary: • spike — a sudden, brief increase in a metric • sustained peak — a high value that stays elevated for a period • connection pool — a set of pre-opened DB connections reused across requests • connection pool exhaustion — all connections in the pool are in use; new requests queue or fail • throughput — the amount of work done per unit of time
Incident report phrases: • "The incident window was 03:14–03:22 UTC (8 minutes)." • "All three key metrics — CPU, memory, and DB connections — returned to baseline by 03:22."
3 / 5
A DevOps engineer describes a Grafana alert to the team:
"We had a throughput of 50,000 events per second at peak with a p95 processing latency of 380ms. After the 14:00 deploy, throughput dropped to 31,000 events/sec and p95 climbed to 2,100ms."
What does this description tell you about the impact of the 14:00 deploy?
Option A is correct because it: • Quantifies the throughput drop as a percentage: (50,000 − 31,000) ÷ 50,000 = 38% • Quantifies the latency increase as a percentage: (2,100 − 380) ÷ 380 ≈ 453% • Calls it a "performance regression" — the correct technical term • Proposes a hypothesis ("new code slowed down processing")
Why option D is insufficient: Raw differences (20K events, 1,720ms) communicate less clearly than relative changes. A 20K drop means very different things at different baselines.
Key vocabulary: • throughput — events/requests/messages processed per unit of time • performance regression — a deploy or change that makes the system slower or less efficient • events per second (EPS) — common unit for stream processing, log ingestion, message queues • p95 latency — 95% of requests complete within this time; 5% take longer • deploy (also: release, rollout, push) — deploying new code to production
Useful phrases for this situation: • "The deploy introduced a performance regression." • "We need to roll back — throughput dropped 38% and p95 latency is now 5× worse." • "Let's compare the flamegraphs before and after the 14:00 deploy to identify the bottleneck."
At a Monday standup, how would you describe what happened?
Reading this CloudWatch output:
ConcurrentExecutions: 493/500 Lambda has a concurrency limit — the maximum number of instances running simultaneously in your AWS account/region. At 493/500, we're nearly at the cap.
Throttles: 847 When the limit is hit, Lambda throttles additional invocations — it returns a 429 (TooManyRequestsException) or queues them. 847 throttles means 847 invocations were rejected or delayed.
Maximum duration: 12,400ms vs Average: 234ms An average of 234ms with a max of 12.4s (53× higher) suggests some invocations waited in queue behind throttled requests — they weren't actually slow to execute, they were slow to start.
Key vocabulary: • throttle — to intentionally limit the rate or volume of requests; a throttle event is when a request is rejected due to limits • concurrency limit — maximum simultaneous executions (AWS Lambda, API Gateway, etc.) • cold start — a Lambda invocation that requires spinning up a new execution environment (adds latency) • invocation — a single execution of a Lambda function • queue depth — the number of pending requests waiting to be processed
Actions to suggest: • "Request a reserved concurrency increase for this function." • "Add an SQS queue in front to absorb bursts without dropping requests." • "Profile the function to see if we can reduce duration and free slots faster."
5 / 5
A week-long Grafana chart for an API service shows this pattern:
Option C is the strongest analysis because it: • Identifies the structural pattern (weekday/weekend cycle) — showing understanding of the full week, not just the spike • Measures the spike relative to weekday baseline (4.6×), not the weekend baseline • Converts all changes to multipliers for clear communication • Connects the symptoms to known causes (cold cache, connection pool, pre-warming)
Calculating the multipliers: • Traffic: 14.8 ÷ 3.2 = 4.6× the weekday baseline • Latency: 890 ÷ 145 = 6.1× the weekday baseline • Error rate: 1.4% ÷ 0.01% = 140× the weekday baseline
Key vocabulary: • traffic cycle / diurnal pattern — regular variation in traffic by time of day or day of week • cold cache — a cache that was cleared or expired over the weekend; first requests must hit the database • pre-warming — sending artificial traffic before peak hours to populate caches and initialise connection pools • Monday morning effect / thundering herd — a surge of traffic when users return after a weekend • connection pool exhaustion — all available DB connections consumed simultaneously
Recommended action: "Set up a pre-warming cron job to run at 07:45 UTC on weekdays to populate the cache and initialise connection pools before the 08:00 traffic peak."