5 exercises on describing what you see on monitoring dashboards in professional IT English.
Dashboard narration vocabulary
Spike: a sharp short-duration increase — "a spike in latency at 14:32"
Trend: a directional change over time — "memory usage trended up over 6 hours"
Correlation: two metrics moving together — "correlates with the deployment at 14:15"
Anomaly: a departure from baseline — "outside the normal operating range"
0 / 5 completed
1 / 5
A Grafana panel shows CPU usage jumping from 30% to 92% at 14:32 and returning to 35% by 14:45. How do you describe this in an incident summary?
Dashboard narration — specifics matter
A professional metric description contains:
Starting value: 30% — the baseline
Peak value: 92% — the spike
Timestamp: 14:32 — when it started
Duration: ~13 minutes — quantified, not vague
Recovery: returned to 35% by 14:45
Correlation: batch job at 14:30 — the likely cause
Why vague descriptions fail: In incident reviews, "CPU was high" gives the on-call team nothing to work with. Specifics enable faster diagnosis.
Monitoring narration phrases:
"[Metric] spiked from [X] to [Y] at [time]."
"The anomaly started [N] minutes before the alert fired."
"This correlates with the deployment/job/traffic spike at [time]."
"The metric returned to baseline at [time] — duration: [N] minutes."
2 / 5
A monitoring dashboard shows memory usage increasing steadily from 4 GB to 7.8 GB over 6 hours, with the OOM killer triggering at hour 7. What type of pattern is this, and what does it suggest?
Memory leak pattern — gradual linear growth
A gradual, monotonically increasing memory trend with no corresponding load increase is a classic memory leak signature:
Memory grows at a steady rate regardless of traffic fluctuations
There is no memory release after requests complete
Eventual OOM (Out of Memory) kill — the OS terminates the process when memory is exhausted
Contrast with normal memory behaviour: Healthy services show memory usage that correlates with load — high during peak traffic, lower during off-peak. A leak grows regardless of traffic.
Dashboard vocabulary for this pattern:
"Memory usage shows a monotonic upward trend over [N] hours — indicative of a leak."
"Memory growth rate is approximately [X] MB/hour — steady regardless of request volume."
"The OOM kill at [time] terminated the process after memory reached [X] GB."
"We need to profile the heap to identify the leak source."
3 / 5
An alert fires for high error rate (threshold: 1%). The graph shows error rate at 0.8% for the past hour but jumped to 1.4% in the last 5 minutes. A colleague asks: "Is this a real incident?" What is the best professional response?
Alert triage — context before declaration
Professional alert response involves context evaluation before action:
Duration check: has the threshold been breached for 5 minutes or 5 hours? A 5-minute spike requires different urgency than an hour-long trend.
Correlation search: is there a recent deployment, traffic spike, or scheduled job that explains the anomaly?
Trend direction: is the error rate still climbing, stabilising, or recovering?
Why immediately declaring P1 (D) can be wrong: Alert fatigue is real. False positives train on-call engineers to dismiss real alerts. Triage first, escalate based on evidence.
Alert triage vocabulary:
"The alert just fired — let me assess before declaring an incident."
"The metric crossed the threshold [N] minutes ago — monitoring for trend direction."
"No correlated deployment in the past 30 minutes — investigating further."
"The spike is recovering — likely a transient event. No incident declaration needed."
4 / 5
After a deployment, the dashboard shows p99 latency increasing from 180ms to 340ms. The p50 latency is unchanged at 45ms. What does this pattern indicate?
p50 vs. p99 divergence — tail latency interpretation
When p99 increases significantly but p50 stays flat:
p50 stable at 45ms: the median request is unaffected
p99 up from 180ms to 340ms: the slowest 1% of requests got nearly 2x slower
This pattern typically means:
A specific code path, database query, or external call is slow for a minority of requests
Could be: a slow DB query for certain record types, increased GC pause, or a new synchronous external call introduced by the deployment
The issue is invisible in averages and medians — only visible at the tail percentiles
Why this matters: For SLOs, p99 latency is often the binding constraint. A stable mean with a degraded tail may still breach the SLO.
Tail latency vocabulary:
"We're seeing p99 degradation without a corresponding p50 increase — tail latency issue."
"The median is healthy but the tail is misbehaving."
"We need to identify what's causing the slowest 1% of requests to be 2x slower."
5 / 5
A dashboard shows request rate dropping 40% at 02:00 UTC and recovering at 08:00 UTC. There are no alerts and no incidents. How would you describe this in a daily standup?
Contextual metric interpretation — expected vs. unexpected patterns
A traffic drop during overnight hours for a geography-concentrated user base is typically expected — not a signal to escalate. The professional description:
States the metric change precisely: "40% drop between 02:00 and 08:00 UTC"
Provides the context that explains it: EU user base in UTC+1 to UTC+2 — so 02:00 UTC = 03:00–04:00 AM local time for users
Confirms no action was needed: "no alert fired — expected behaviour"
Why this matters: Raising a non-incident as a potential incident wastes team time and creates noise. A well-calibrated engineer distinguishes expected patterns from anomalies.
Normal pattern vocabulary:
"This is within the normal overnight traffic window for our user base."
"The traffic shape follows our typical diurnal pattern."
"No anomaly detected — metric behaviour is consistent with historical baseline."