5 exercises — Prometheus metric types, PromQL rate(), AlertManager silences, on-call acknowledge vs resolve, and alert fatigue — essential monitoring vocabulary for SREs and DevOps engineers.
0 / 5 completed
1 / 5
A Prometheus dashboard shows two metrics: http_requests_total and memory_used_bytes. Which is a counter and which is a gauge?
Prometheus has four metric types. A counter is a cumulative value that only increases (or resets to zero on restart) — request counts, error counts, bytes sent. A gauge is a value that can go up or down — memory usage, active connections, queue depth, temperature. A histogram samples observations into configurable buckets (e.g. request latency). A summary calculates configurable quantiles client-side. Counters are nearly always queried with rate() or increase(), never raw.
2 / 5
In PromQL, what does rate(http_requests_total[5m]) calculate?
rate(counter[range]) calculates the per-second average increase rate of a counter over the given time window — it handles counter resets (process restarts) automatically. rate(http_requests_total[5m]) gives you requests/second averaged over 5 minutes. Use irate() for instantaneous rate (last two data points — more responsive but spiky). Use increase(counter[range]) when you want the total increase over the window rather than a per-second rate. Never use rate() on a gauge — use delta() or deriv() instead.
3 / 5
"The on-call rotation was getting paged for a known maintenance window, so we created a _____ in AlertManager to suppress those alerts for 2 hours without changing any routing rules."
A silence in AlertManager is a time-bounded suppression rule: you specify a set of label matchers and a duration, and any alerts matching those labels during that window are swallowed — they never reach the receiver (Slack, PagerDuty, email). Silences are created via the AlertManager UI or API and expire automatically. Compare with an inhibition rule (permanent config that suppresses child alerts when a parent alert fires, e.g. suppress host-level alerts when the whole datacenter is down) and a dead man's switch (a "watchdog" alert that fires when the monitoring pipeline itself stops working).
4 / 5
During an incident, PagerDuty shows an alert as "acknowledged." What does that mean — and how is it different from "resolved"?
In on-call tools (PagerDuty, OpsGenie, etc.): Acknowledge tells the system "I have seen this — stop escalating to the next person in the rotation." The incident is still open; you are working on it. Resolve closes the incident, indicating the root cause is fixed and the service is healthy. If an acknowledged alert is not resolved within a configured timeout, it may re-escalate. Assign transfers ownership to another responder. Good incident hygiene means acknowledging immediately when paged (within SLA), then resolving only when the issue is genuinely fixed — not just when the alert stops firing.
5 / 5
"The team is suffering from alert fatigue." What is alert fatigue, and why is it dangerous?
Alert fatigue occurs when the volume or noise of alerts is so high that on-call engineers become desensitised — they batch-acknowledge pages without investigating, miss real incidents in the noise, or develop anxiety that degrades judgement. Common causes: alerting on symptoms rather than user impact, thresholds set too low, too many "warning" alerts that never require action. Remedies: ruthlessly delete or raise thresholds for low-signal alerts, alert on SLO burn rate rather than raw metrics, use multi-window burn rate alerts (fast + slow windows), and conduct regular alert review rotations. High alert fatigue is a leading indicator of a future major outage being missed.