Vocabulary for Observability Engineers: Logs, Metrics, and Traces

Essential English vocabulary for observability engineers: the three pillars, cardinality, percentiles, alerting terms, and how to use each correctly in context.

Observability is how engineers understand what a system is doing from the outside — by collecting logs, metrics, and traces. The field has a dense, specific vocabulary, and many terms (cardinality, percentile, span) trip up non-native speakers. This guide explains the essential words in context so you can read dashboards, write alerts, and discuss incidents fluently.


The three pillars

Observability rests on three data types. Know exactly what each is and how it’s used.

PillarWhat it isA sentence
LogsTimestamped text records of events”Grep the logs for the error.”
MetricsNumeric measurements over time”CPU is a gauge metric.”
TracesThe path of a request across services”The trace shows where the latency is.”

“When a request is slow, metrics tell you that it’s slow, traces tell you where, and logs tell you why.”

That sentence is the single best way to remember the difference — and a great thing to say in an interview.


Metric vocabulary

Metrics have specific types and you must use the right word.

TermMeaning
CounterA number that only goes up (e.g. total requests)
GaugeA number that goes up and down (e.g. memory in use)
HistogramA distribution of values (e.g. latency buckets)
RateHow fast a counter increases
CardinalityThe number of unique label combinations

“Request count is a counter, so we look at its rate, not its raw value. Memory is a gauge — we read it directly.”

Cardinality is the term that confuses people most. High cardinality means too many unique combinations (e.g. tagging metrics by user ID), which explodes storage cost.

“Don’t put user_id in a metric label — it’s a cardinality explosion. Use it in traces instead.”


Percentiles: the most misused word

Latency is described with percentiles, not averages. You must say them correctly.

NotationSaid asMeaning
p50”p fifty” / “the median”Half of requests are faster
p95”p ninety-five”95% are faster, 5% slower
p99”p ninety-nine”99% are faster
p99.9”p three nines”The slowest 0.1%

“The average latency looks fine, but the p99 is terrible — our slowest 1% of users are suffering. Averages hide tail latency.

Tail latency (the slow end of the distribution) is critical vocabulary. The “tail” is the long thin part of the curve.

“We’re chasing the long tail — a small number of very slow requests dragging the p99 up.”


Logging vocabulary

TermMeaning
Log levelSeverity: DEBUG, INFO, WARN, ERROR
Structured loggingLogs as key-value/JSON, not free text
Log lineA single log entry
VerboseProducing a lot of log output
Correlation IDAn ID linking logs from one request
SamplingKeeping only a fraction of logs/traces

“Use structured logging with a correlation ID so you can stitch together every log line for a single request across services.”

The verb stitch together (combine related pieces) is natural and useful here.


Tracing vocabulary

TermMeaning
SpanOne unit of work in a trace
Parent / child spanNesting of operations
Trace IDThe ID for the whole request journey
InstrumentationAdding tracing code to your service
Distributed traceA trace spanning multiple services

“Each service adds a span to the distributed trace. The waterfall view shows that the database span is eating 80% of the time.”

Instrument is both a noun-derived verb here: “we need to instrument the payment service” means add observability code to it.


Alerting vocabulary

TermMeaning
ThresholdThe value that triggers an alert
FireWhen an alert triggers
FlappingAlert toggling on and off rapidly
NoisyAlerts that fire too often without value
Alert fatigueBecoming numb to too many alerts
ActionableAn alert a human can actually do something about

“This alert is flapping and noisy — it’s causing alert fatigue. If it’s not actionable, let’s delete it. Every alert should require a human action.

The verb is fire: “the alert fired at 2 a.m.” — not “the alert was activated.”


Phrases for incident discussions

  • “Let’s pull up the dashboard and look at the p99.”
  • “I’ll dig into the traces to find the slow span.”
  • “The metrics are flat — no change — but the logs show errors.”
  • “We’re flying blind here; this service isn’t instrumented.”
  • “Let’s drill down from the service level to the endpoint.”

“Error rate spiked, p99 shot up, and the traces point to a single slow downstream call. The logs confirm a timeout. Classic.”

Flying blind (operating without visibility) is excellent vocabulary for an uninstrumented system.


Common mistakes

  1. Saying “average” when you mean “p99.” Averages hide the worst cases. SREs almost always care about percentiles.
  2. Confusing “counter” and “gauge.” A counter only increases; a gauge moves both ways. Using the wrong one breaks your math.
  3. Mispronouncing “cardinality.” It’s /ˌkɑːrdɪˈnæləti/ — “car-di-NAL-i-ty.”
  4. Saying “logs says.” Logs is plural: “the logs say,” “a log line shows.”
  5. Using “metric” for everything. A log is not a metric. Keep the three pillars distinct.

Quick reference glossary

  • SLO / SLI — reliability targets and the metrics behind them
  • Golden signals — latency, traffic, errors, saturation
  • Saturation — how full a resource is
  • Aggregation — combining many data points (sum, avg, max)
  • Retention — how long data is kept
  • Dashboard — a visual panel of metrics
  • Heatmap — a visualisation of distribution over time

“Monitor the four golden signals — latency, traffic, errors, saturation — and you’ll catch most problems before users do.”


Key takeaways

  • Metrics = that it’s slow, traces = where, logs = why.
  • Talk in percentiles (p95, p99), not averages — averages hide tail latency.
  • Mind cardinality: high-cardinality labels explode cost; put unique IDs in traces.
  • Alerts should be actionable; noisy, flapping alerts cause alert fatigue.

Master this vocabulary and dashboards stop being intimidating walls of numbers — they become a language you read fluently, even at 3 a.m.