Observability Glossary

34 of the words you'll hear in on-call, SRE reviews and incident retros — what each one means, an example or formula, and the gotcha you'll wish someone had told you.

Last reviewed:

Sections

The three pillars

metric

A numeric measurement sampled over time — request rate, CPU, queue depth. Cheap to store and aggregate; great for dashboards and alerts.

http_requests_total{method="GET",status="200"} 48213

log

A timestamped record of a discrete event. High detail, high volume — the thing you grep when an alert fires.

2026-05-29T10:02:11Z ERROR order_id=4821 payment_declined gateway=stripe

trace

The end-to-end story of one request as it crosses services, made of nested spans. Answers "where did the time go?".

trace_id=ab12 → [api 120ms] → [db 80ms] → [cache 5ms]

span

A single timed unit of work within a trace — one operation, with a start, duration, parent, and attributes.

span: "SELECT orders" parent=api duration=80ms db.system=postgres

structured logging

Logging machine-parseable key/value records (usually JSON) instead of free-form text, so logs can be queried and filtered.

{"level":"error","msg":"payment failed","order_id":4821,"latency_ms":250}

💡 Include a trace_id field so logs link straight to the trace.

Reliability & SLOs

SLI

Service Level Indicator — the actual measured number you care about, e.g. the proportion of requests served under 300ms.

SLI = good_requests / valid_requests = 0.9987

SLO

Service Level Objective — the internal target for an SLI over a window, e.g. 99.9% of requests under 300ms over 30 days.

SLO: 99.9% availability over 30 days

SLA

Service Level Agreement — a contractual promise to customers, with penalties. Usually looser than your internal SLO.

SLA: 99.5% uptime or we credit your account

error budget

The allowed amount of failure: 100% minus the SLO. A 99.9% SLO grants ~43 minutes of downtime per 30 days to spend on risk.

budget = (1 - 0.999) * 30d = 43.2 min/month

💡 When the budget is spent, you freeze risky deploys and focus on reliability.

burn rate

How fast you’re consuming the error budget relative to "on pace". A burn rate of 1 spends it exactly over the window; higher means trouble.

burn_rate = (error_rate) / (1 - SLO)
14.4x over 1h → page now

💡 Multi-window, multi-burn-rate alerts catch both fast outages and slow leaks.

golden signals

Google SRE’s four signals to watch on any user-facing service: latency, traffic, errors, and saturation.

latency (p99), traffic (req/s), errors (5xx %), saturation (% capacity)

Incident metrics

MTTR

Mean Time To Recover/Repair — average time from an incident starting to service being restored. The headline DORA reliability metric.

MTTR = total_downtime / number_of_incidents

MTTD

Mean Time To Detect — average time between a problem starting and someone (or an alert) noticing it.

MTTD = sum(detect_time - start_time) / incidents

MTTA

Mean Time To Acknowledge — average time from an alert firing to an on-call engineer acknowledging it.

MTTA = sum(ack_time - alert_time) / alerts

MTBF

Mean Time Between Failures — average healthy interval between incidents. Higher is better.

MTBF = total_uptime / number_of_failures

uptime / availability (nines)

The fraction of time a service is up, quoted in "nines". Each extra nine cuts allowed downtime by 10x.

99.9%  = 43.2 min/month
99.99% = 4.3 min/month
99.999% = 26 s/month

Core concepts

cardinality

The number of unique label combinations a metric has. High cardinality (e.g. user_id as a label) blows up storage and cost.

http_requests{user_id="..."} → millions of series = bad

💡 Keep IDs and free-form values out of metric labels; put them in traces or logs.

label / tag

A key/value dimension attached to a metric (Prometheus calls it a "label"; Datadog/StatsD call it a "tag") so you can slice and filter.

http_requests_total{method="POST", route="/checkout"}

sampling

Keeping only a fraction of traces/logs to control cost. Head sampling decides up front; tail sampling decides after seeing the whole trace.

sample 100% of errors, 1% of healthy traces

percentile (p50/p95/p99)

The value below which that share of measurements fall. p99 latency = 99% of requests were faster than this. Averages hide the tail.

p50=80ms p95=240ms p99=900ms

💡 You can’t average percentiles across instances — aggregate from a histogram instead.

histogram

A metric that buckets observations (e.g. request durations) so you can compute percentiles and heatmaps server-side.

http_duration_bucket{le="0.3"} 9821
http_duration_bucket{le="1.0"} 9990

gauge

A metric that can go up and down — a snapshot of a value right now (temperature, queue length, memory in use).

queue_depth 47

counter

A metric that only ever increases (or resets to zero on restart). You take its rate() to get a per-second value.

rate(http_requests_total[5m])

instrumentation

The code that emits metrics, logs and traces. Can be manual (you add it) or automatic (an agent/library injects it).

span := tracer.Start(ctx, "checkout")
defer span.End()

context propagation

Passing trace identifiers (trace_id, span_id) across service boundaries — usually in HTTP headers — so spans link into one trace.

traceparent: 00-ab12...-cd34...-01

RED method

For request-driven services, watch Rate, Errors, and Duration. A simpler, service-focused cousin of the golden signals.

Rate=req/s  Errors=fail %  Duration=p99 latency

USE method

For resources (CPU, disk, network), watch Utilisation, Saturation, and Errors. Brendan Gregg’s checklist for finding bottlenecks.

CPU: Utilisation 80%, Saturation (run-queue) 4, Errors 0

Tooling

Prometheus

The de-facto open-source metrics database. Pulls (scrapes) metrics from targets and stores time series; queried with PromQL.

scrape_configs:
  - job_name: api
    static_configs:
      - targets: ["api:9090"]

PromQL

Prometheus Query Language — selects and aggregates time series for dashboards and alerts.

sum(rate(http_requests_total{status=~"5.."}[5m])) by (route)

Grafana

The dashboarding and visualisation layer. Plugs into Prometheus, Loki, and many other data sources; also handles alerting.

Panel query: rate(http_requests_total[5m])

OpenTelemetry (OTEL)

The vendor-neutral standard (and SDKs) for generating and exporting metrics, logs and traces. The Collector reshapes and ships them anywhere.

otel-collector → exporters: [otlp, prometheus, loki]

Jaeger

An open-source distributed tracing backend — stores and visualises traces so you can see span timing across services.

OTEL_EXPORTER_OTLP_ENDPOINT=http://jaeger:4317

Loki

Grafana’s log database — indexes only labels, not full text, making it cheap. Queried with LogQL.

{app="api"} |= "error" | json | latency_ms > 500

Datadog

A commercial all-in-one observability SaaS — metrics, logs, traces (APM), and dashboards behind one agent and UI.

DD_API_KEY=... datadog-agent run

💡 Convenient but billed largely on custom metrics and ingested log/trace volume — watch cardinality.

English phrases engineers use

  • "We've burned through the error budget this month — freeze the risky deploys."
  • "Don't look at the average; watch the p99, that's where users feel it."
  • "That metric has way too much cardinality — pull user_id out of the labels."
  • "Follow the trace — I want to see which span ate the latency."
  • "What's our MTTR on this? We detected it fast but recovery dragged."
  • "Page on a high burn rate, not on a single 5xx."
  • "Add structured logging with a trace_id so logs link to traces."