English for Prometheus Alerting Rules
Learn the English vocabulary for discussing Prometheus alerting rules: expressions, the for clause, alert states, and routing through Alertmanager.
A noisy alerting system erodes trust faster than almost anything else in an on-call rotation, and the vocabulary for discussing Prometheus alerting rules precisely — the expression, the for clause, the alert state — is what lets a team actually fix the noise instead of just muting it.
Key Vocabulary
Alerting rule — a PromQL expression evaluated on a schedule that, when true, generates an alert, defined declaratively alongside metric recording rules rather than as imperative monitoring code. “That alerting rule fires on every deploy because it’s checking absolute error count instead of error rate — a brief count spike during a rolling restart isn’t actually the failure condition we care about.”
For clause — a duration specified on an alerting rule requiring the condition to remain true continuously for that length of time before the alert actually fires, used to filter out brief, self-resolving blips.
“We’re getting paged for latency spikes that resolve in ten seconds — adding a for: 2m clause to this rule would mean it only fires if the condition holds for two full minutes, which filters out the noise without hiding a real, sustained problem.”
Pending state — the state an alert enters once its expression becomes true but before its for duration has elapsed, distinct from firing, which only happens once the condition has held long enough.
“Don’t page on pending state — that just means the condition started being true a moment ago. The alert only actually fires, and should only actually page someone, once it clears the for duration.”
Label — a key-value pair attached to an alert, either inherited from the underlying metric or added explicitly in the rule, used by Alertmanager to route, group, and silence alerts based on team, severity, or service.
“This alert went to the wrong team’s channel because the team label wasn’t set on the rule — Alertmanager routes purely off labels, so a missing or wrong one sends the alert to the default route instead of the right one.”
Runbook annotation — an annotation field on an alerting rule, usually a URL, linking directly to the response procedure for that specific alert, included so the person paged doesn’t have to search for context during an incident. “Every new alerting rule needs a runbook annotation before it ships — an alert that just says ‘high error rate’ with no link to what to actually do about it makes the 3am page a lot worse than it needs to be.”
Common Phrases
- “Is this alerting rule checking a rate, or an absolute count that could spike for unrelated reasons?”
- “Does this rule have a for clause, or does it fire on the very first evaluation where it’s true?”
- “Is this alert in a pending state, or has it actually fired?”
- “Are the labels set correctly, or is this getting routed to the wrong team?”
- “Does this rule have a runbook annotation, or is whoever gets paged starting from zero?”
Example Sentences
Reviewing a noisy alert in a retro:
“This alert paged four times last week and self-resolved every time within thirty seconds. I’d add a for: 5m clause — the underlying condition being briefly true isn’t actually actionable, and a sustained five minutes is a better signal that something’s actually wrong.”
Explaining alert routing:
“The reason this fired in the platform team’s channel instead of ours is the team label on the rule — it inherited the label from the underlying metric instead of being explicitly set, and the metric’s default doesn’t match our team.”
Pushing for better incident readiness: “Before we enable this rule in production, it needs a runbook annotation. Right now it just says ‘disk usage high’ — that tells the on-call engineer there’s a problem, but nothing about which disk, which host, or what the actual response steps are.”
Professional Tips
- Base alerting rules on rates or ratios rather than absolute counts wherever possible — absolute thresholds are far more prone to false positives from normal fluctuations like deploys or traffic spikes.
- Add a for clause to any rule prone to brief, self-resolving blips — a rule with no for clause pages on the very first evaluation where the condition is true, even if it clears a second later.
- Understand the distinction between pending and firing states clearly enough to explain it to a new team member — it’s the mechanism the for clause actually relies on, and confusing the two leads to misconfigured or misunderstood rules.
- Set labels deliberately on every alerting rule, not just inherit them by default — correct routing depends entirely on labels being accurate, and a missing team or severity label sends the alert to whoever happens to own the default route.
- Require a runbook annotation before any new alerting rule ships to production — an alert without a linked response procedure shifts real diagnostic work onto whoever is paged, at the worst possible time to be doing it.
Practice Exercise
- Explain what a for clause does and why it reduces alert noise.
- Describe the difference between an alert’s pending state and its firing state.
- Write a sentence explaining why labels matter for alert routing in Alertmanager.