Advanced 10 terms

Monitoring & Alerting — Prometheus, Grafana, PagerDuty

Vocabulary for monitoring and alerting stacks: Prometheus metric types, PromQL, AlertManager, Grafana, PagerDuty, and on-call operations.

  • Counter (Prometheus) /ˈkaʊntər/

    A Prometheus metric type that only ever increases (or resets to zero on restart). Used for counting events like requests, errors, or bytes sent. Always use rate() or increase() to compute the rate of change.

    "http_requests_total is a counter — it only goes up. To get the request rate per second over the last 5 minutes: rate(http_requests_total[5m])."
  • Gauge (Prometheus) /ɡeɪdʒ/

    A Prometheus metric type that can go up or down — representing a current snapshot value like memory usage, queue length, or active connections.

    "memory_usage_bytes is a gauge: it reflects the current heap usage. Unlike a counter, it can decrease when garbage collection frees memory."
  • rate() (PromQL) /reɪt/

    A PromQL function that calculates the per-second average rate of increase of a counter over a specified time range, accounting for counter resets.

    "rate(http_requests_total{status="500"}[5m]) gives the 5-minute average rate of 500 errors per second — use this instead of raw counter values for alert thresholds."
  • Alert fatigue /əˈlɜːt fəˈtiːɡ/

    The desensitisation of on-call engineers to alerts caused by excessive, low-quality, or noisy alerts — leading to ignoring or delaying responses to genuine incidents.

    "We suffered severe alert fatigue: 94% of pages were auto-resolved before the on-call engineer even looked at them. Pruning the alert ruleset from 340 to 40 high-fidelity alerts reduced MTTR by 40% and improved on-call satisfaction."
  • Silence (AlertManager) /ˈsaɪləns/

    A scheduled suppression of alerts in AlertManager, matched by label selectors — used during planned maintenance to prevent false-positive pages.

    "Before the database migration, I created a 4-hour AlertManager silence matching {job="postgres"} so the on-call team wouldn't be paged during the planned maintenance window."
  • Escalation policy /ˌeskəˈleɪʃən ˈpɒlɪsi/

    A defined sequence of who to notify and when if an alert is not acknowledged within a specified time — ensuring incidents reach the right person even if the primary on-call is unavailable.

    "Our escalation policy: page the primary on-call first, if unacknowledged after 5 minutes — page the secondary, after another 5 minutes — page the engineering manager. The chain ensures 24/7 coverage."
  • Acknowledge /əkˈnɒlɪdʒ/

    In on-call tooling (PagerDuty, OpsGenie), the act of claiming an incident to signal that a responder is aware and actively investigating — stopping escalation to the next tier.

    "I acknowledged the PagerDuty alert within 2 minutes, which stopped the escalation to the secondary and gave me ownership of the incident investigation."
  • MTTR (Mean Time to Recovery) /miːn taɪm tə rɪˈkʌvəri/

    The average time from when an incident starts until the service is fully restored — a key SRE reliability metric measuring incident response effectiveness.

    "Our MTTR dropped from 47 minutes to 18 minutes after we added runbooks to each alert — responders had a documented remediation path instead of starting from scratch."
  • Recording rule (Prometheus) /rɪˈkɔːdɪŋ ruːl/

    A pre-computed PromQL expression that is stored as a new time series, making expensive queries fast by pre-aggregating data on the Prometheus server.

    "Instead of running the expensive cross-service aggregation query on every dashboard load, we defined a recording rule job:http_requests:rate5m that pre-computes and stores the result every 15 seconds."
  • Contact point (Grafana) /ˈkɒntækt pɔɪnt/

    In Grafana Alerting, a configured notification destination — such as Slack, email, PagerDuty, or webhook — where alert notifications are sent when an alert rule fires.

    "We configured two contact points: a Slack channel for warning-level alerts and a PagerDuty integration for critical alerts — the notification policy routes alerts to the appropriate contact point based on severity label."

Ready to practice?

Test your knowledge of these terms in the interactive exercise.

Start exercise →