English for Prometheus and Metrics

Master the vocabulary for discussing counters, gauges, histograms, and alerting rules when working with Prometheus and metrics-based monitoring.

Metrics vocabulary gets mixed up constantly — “counter” and “gauge” sound interchangeable to someone new to Prometheus, but they behave completely differently and mixing them up produces genuinely broken dashboards. Precise language here matters especially during an incident, when a team needs to interpret a graph correctly under time pressure rather than guess at what a metric actually represents.

Key Vocabulary

Counter A metric type that only ever increases (or resets to zero on restart), used for things like total requests served or total errors — you typically graph its rate of change, not its raw value. Example: “This is a counter, so the raw value climbing steadily is expected — what we actually care about is the rate of increase over the last five minutes, not the absolute number.”

Gauge A metric type that can go up or down freely, used for things like current memory usage or the number of active connections at a given moment. Example: “Unlike the request counter, this is a gauge — it’s meant to fluctuate, and a sudden drop to zero is the concerning signal, not a steady climb.”

Histogram A metric type that samples observations (like request durations) into configurable buckets, allowing you to compute quantiles like p95 or p99 latency after the fact. Example: “We’re using a histogram for request latency so we can calculate p99 later, rather than only ever knowing the average, which hides tail latency problems.”

Scrape The process by which Prometheus pulls metrics from a target at a regular interval, rather than the target pushing metrics to Prometheus. Example: “This service’s metrics aren’t showing up because Prometheus isn’t configured to scrape its metrics endpoint — the target isn’t in our scrape config yet.”

Label (metric label) A key-value pair attached to a metric that allows it to be sliced and filtered along a dimension, like endpoint or status_code, without needing a separate metric per value. Example: “Instead of a separate metric per endpoint, we use a single counter with an endpoint label, so we can filter or aggregate by that dimension in queries.”

Cardinality The number of distinct label value combinations a metric can produce, which directly affects memory and storage cost — high-cardinality labels (like raw user IDs) are a common source of performance problems. Example: “We accidentally added a raw user ID as a label on this metric, which exploded cardinality and is why memory usage on the Prometheus server spiked.”

Alerting rule A condition, expressed as a query, that triggers a notification when it evaluates to true for a defined duration, distinguishing a transient blip from a genuinely sustained problem. Example: “We added a for: 5m clause to this alerting rule so a brief latency spike doesn’t page anyone unless it’s sustained for five minutes.”

PromQL The query language used to select, filter, and aggregate metrics stored in Prometheus, used both for dashboards and for defining alerting rules. Example: “This PromQL query is summing the rate of errors across all instances, but we actually want it broken down per instance to find which one is failing.”

Common Phrases

In code reviews:

  • “This metric is being instrumented as a gauge, but it only ever increases — should it actually be a counter instead?”
  • “This label includes a raw request ID, which will create unbounded cardinality — can we drop it or bucket it into something coarser?”
  • “The alerting rule doesn’t have a for duration, so a single scrape spike will page someone even if it resolves on its own within seconds.”

In standups:

  • “Yesterday I converted this misclassified gauge to a proper counter with rate-based alerting; today I’m validating the new dashboard against yesterday’s incident data.”
  • “I’m blocked on a cardinality explosion — a label we added last week is generating far more time series than expected, and the Prometheus server is under memory pressure.”
  • “I finished adding a histogram for this endpoint’s latency, so we can finally track p95 instead of only having an average.”

In incident retrospectives:

  • “The dashboard showed a counter’s raw value climbing, which looked alarming, but that’s just expected counter behavior — the actual signal we needed was the rate, which was flat.”
  • “We didn’t get paged for this because the alerting rule’s threshold was set against an average, and the p99 latency was already well past what users were experiencing as slow.”
  • “High cardinality on this label caused the query to time out during the incident, right when we needed to look at the dashboard the most.”

Phrases to Avoid

Saying “the metric is wrong” for expected counter behavior. Say instead: “this is a counter, so the raw value only ever increases — we should be looking at its rate, not the absolute number” — this is one of the most common points of confusion for anyone new to reading Prometheus dashboards.

Saying “add more labels” without considering cardinality. Say instead: “check whether this label could produce unbounded values before adding it — high cardinality labels are a common cause of Prometheus performance problems.” Naming cardinality explicitly signals you’re aware of the tradeoff, not just adding dimensions freely.

Saying “alert on average latency” as a default. Say instead: “alert on a specific quantile, like p95 or p99, since averages hide the tail latency that actually affects real users” — this is a frequent, consequential mistake in alerting rule design.

Quick Reference

TermHow to use it
counter”This counter only increases; graph its rate, not the raw value.”
gauge”The gauge should fluctuate normally; a drop to zero is the concern.”
histogram”The histogram lets us compute p99 latency, not just an average.”
scrape”The target isn’t in our scrape config, so its metrics aren’t collected.”
cardinality”A raw user ID label exploded cardinality on this metric.”
alerting rule”The for: 5m clause prevents a brief blip from paging anyone.”

Key Takeaways

  • Distinguish counters from gauges precisely — a climbing counter is expected behavior, while a fluctuating gauge dropping unexpectedly is the actual signal to watch.
  • Prefer histograms over simple averages for latency metrics, since averages hide tail latency that matters most to real users.
  • Watch cardinality carefully before adding a new label — unbounded label values are a leading cause of Prometheus performance and cost problems.
  • Use a for duration in alerting rules to distinguish a transient blip from a sustained, genuinely actionable problem.
  • When a dashboard looks alarming, check whether the metric type explains the shape (a counter’s steady climb) before treating it as an incident signal.