Prometheus & PromQL Vocabulary: Monitoring Terms for DevOps Engineers
Prometheus scraping, metric types, PromQL queries, Alertmanager, and observability vocabulary for SREs.
If you work in DevOps or Site Reliability Engineering, you will encounter Prometheus almost everywhere. It has become the de facto standard for metrics-based monitoring in cloud-native environments. But beyond learning the tooling itself, you need to speak the language fluently — in standups, incident calls, code reviews, and architecture discussions. This guide covers the essential Prometheus and PromQL vocabulary you need to sound confident and precise in English-speaking teams.
Core Concepts: Prometheus and Data Collection
Prometheus — an open-source systems monitoring and alerting toolkit originally built at SoundCloud. Prometheus collects time-series data by pulling metrics from configured targets at regular intervals.
“We migrated from Datadog to Prometheus last quarter — the cost savings alone justified it.”
“Prometheus is scraping our services every fifteen seconds; that should be granular enough for the SLO dashboards.”
scrape interval — the frequency at which Prometheus pulls (or “scrapes”) metrics from a target. The default is typically 15 or 30 seconds, but it can be tuned per job or target.
“Your scrape interval is set to one minute — that’s too coarse for a latency alert with a five-second threshold.”
“We lowered the scrape interval on the payment service to ten seconds after the last incident.”
target — any endpoint that Prometheus monitors. A target exposes metrics over HTTP, usually at the /metrics path. Targets are discovered statically (via config) or dynamically (via service discovery).
“Half our targets are showing as
DOWNin the Prometheus UI — looks like the firewall rule is blocking port 9090.”
“We use Kubernetes service discovery so new pods register as targets automatically.”
exporter — a small programme that converts metrics from a third-party system (a database, a message queue, hardware) into the Prometheus format and exposes them as a scrape target. Common examples include node_exporter for host-level metrics and postgres_exporter for PostgreSQL.
“We added the Redis exporter to the sidecar so Prometheus can track memory usage per instance.”
“The node exporter gives us CPU, memory, and disc I/O without any code changes in the application itself.”
Metric Types
Understanding metric types is critical — the type you choose determines which PromQL functions you can apply.
counter — a metric that only ever increases (or resets to zero on restart). Counters are used for values like total HTTP requests, errors, or bytes sent. You almost always apply rate() or increase() to a counter in queries.
“Don’t use a gauge for request counts — that’s a counter. It should only go up.”
“The counter reset after the pod restarted, so the graph has that drop. Use
rate()and it will handle resets automatically.”
gauge — a metric that can go up or down. Gauges represent a current state: CPU usage, active connections, queue depth, memory consumption.
“Temperature, free disc space, in-flight requests — these are all gauges. They reflect a snapshot of the current value.”
“Our goroutine gauge has been climbing for three hours. Something is leaking.”
histogram — a metric that samples observations and organises them into configurable buckets. Histograms are ideal for measuring request latency and response size distributions. They expose _bucket, _sum, and _count time series, enabling percentile calculations with histogram_quantile().
“Use a histogram for latency so we can compute the p99 without storing every individual data point.”
“The histogram buckets were too coarse — we were missing the tail latency above 500ms entirely.”
summary — similar to a histogram but calculates quantiles on the client side. Summaries are less flexible than histograms when aggregating across multiple instances and are generally discouraged in favour of histograms in modern setups.
“We switched from summary to histogram because summaries can’t be aggregated across replicas.”
“The summary gives you quantiles cheaply, but you lose the ability to re-aggregate in PromQL.”
PromQL: Querying Metrics
PromQL (Prometheus Query Language) — the functional query language for selecting and aggregating time-series data in Prometheus. It is used in dashboards, alert rules, and recording rules.
“Can you write the PromQL for the error rate? I need it for the Grafana panel.”
“PromQL looks intimidating at first, but once you understand selectors and aggregations it becomes very readable.”
instant vector — a PromQL expression that returns a single value per time series at a specific point in time. Most simple queries return an instant vector.
“An instant vector gives you the current value for each label set — perfect for a gauge display in Grafana.”
range vector — a PromQL expression that returns a range of values over a specified time window (e.g., [5m]). Range vectors are required by functions like rate() and increase().
“You need a range vector for
rate()— something likehttp_requests_total[5m].”
“Extend the range vector to
[15m]if your scrape interval is one minute; otherwise the rate calculation will be noisy.”
label — a key-value pair attached to a metric that adds dimensional context. Labels allow you to filter, group, and aggregate metrics. Examples include job, instance, status_code, and env.
“Always add an
envlabel so you can separate production metrics from staging in the same Prometheus instance.”
“High label cardinality — like using a user ID as a label — will cause performance problems. Keep labels low-cardinality.”
selector — a PromQL syntax construct that filters time series by their labels, using {} notation. You can match exact values (=), exclude values (!=), or use regular expressions (=~, !~).
“Add
{status_code=~\"5..\"}to your selector to isolate server errors.”
“That selector is too broad — it’s matching metrics from every service. Add
job=\"payments\"to narrow it down.”
aggregation operator — PromQL operators that combine multiple time series into fewer series. The most common are sum(), avg(), min(), max(), and count(). The rate() function is also fundamental and is used to compute the per-second rate of increase of a counter.
“Wrap it in
sum by (status_code)— you want one line per status code, not one per instance.”
“Use
rate(http_requests_total[5m])to get requests per second, thensum by (job)to aggregate across replicas.”
“The
avg()across regions was hiding the fact that one region had a 40% error rate.”
Alerting and Rules
recording rule — a pre-computed PromQL expression that Prometheus evaluates on a schedule and stores as a new time series. Recording rules improve query performance for expensive expressions used repeatedly in dashboards or alerts.
“That dashboard is slow because it’s computing a heavy aggregation on every load. Wrap it in a recording rule.”
“We have a recording rule for the SLO error budget — it runs every minute and the Grafana panel just reads that series.”
alert rule — a PromQL expression that, when it evaluates to a non-empty result for a specified duration (for clause), fires an alert. Alert rules are defined in Prometheus configuration and routed through Alertmanager.
“The alert rule is correct, but add
for: 5mso we don’t page on a single bad scrape.”
“Write the alert rule so it triggers only when the error rate exceeds one percent for ten minutes.”
Alertmanager — a component that handles alerts fired by Prometheus. It manages deduplication, grouping, routing, silencing, and inhibition before sending notifications to receivers such as PagerDuty, Slack, or email.
“The alert is firing in Prometheus, but nothing arrived in Slack — check the Alertmanager routing config.”
“We centralised all alerting through Alertmanager so the on-call team gets one notification per incident, not fifty.”
silencing — the act of suppressing alerts in Alertmanager for a defined period, typically during planned maintenance or a known incident. A silence matches alerts by label selectors and prevents notifications from being sent.
“Create a silence for the node exporter alerts before you reboot that host, otherwise the on-call will get paged.”
“The silence expired at midnight and the alerts flooded in. We need to extend it or fix the underlying issue.”
inhibition — an Alertmanager feature that suppresses certain alerts when a higher-priority alert is already firing. For example, you can inhibit all service-level alerts when a datacenter-down alert is active, reducing noise during major incidents.
“Set up inhibition so that pod-level alerts are suppressed whenever the whole cluster is marked as degraded.”
“Without inhibition rules, a single network failure generates hundreds of downstream alerts. It makes triage impossible.”
How to Use These in Conversation
Scenario 1 — Incident triage call:
“Prometheus is showing a spike in
http_requests_total— the rate over the last five minutes is three times normal. The histogram shows p99 latency is over two seconds. I’ve silenced the downstream service alerts while we investigate.”
Scenario 2 — Code review comment:
“This metric should be a counter, not a gauge — the value only ever increases. Also, the label
user_idis going to cause cardinality explosion. Replace it with atierlabel.”
Scenario 3 — Architecture discussion:
“We should add a recording rule for this SLO expression — it’s evaluated in four different Grafana panels and it’s an expensive range query over 24 hours. The recording rule will compute it once per minute and keep dashboards snappy.”
Scenario 4 — Handover to on-call:
“There’s an active silence on the staging exporters until 08:00. The Alertmanager inhibition rule will suppress pod restarts if the cluster-down alert fires. The PromQL for the error budget is in the runbook — just change the
envselector if you need to check staging separately.”
Quick Reference
| Term | Type | Plain English Summary |
|---|---|---|
| Prometheus | Tool | Open-source monitoring system that scrapes and stores time-series metrics |
| exporter | Component | Adaptor that exposes third-party system metrics in Prometheus format |
| scrape interval | Config | How often Prometheus pulls metrics from a target |
| counter | Metric type | Monotonically increasing value; use rate() to query it |
| gauge | Metric type | Current snapshot value that can go up or down |
| histogram | Metric type | Bucketed observations; enables histogram_quantile() for percentiles |
| label | Data model | Key-value tag that adds dimensions to a metric |
| PromQL | Language | Query language for selecting and aggregating Prometheus metrics |
| recording rule | Config | Pre-computed query stored as a new time series for performance |
| Alertmanager | Tool | Routes, deduplicates, silences, and inhibits alerts before notification |