metric
A numeric measurement sampled over time — request rate, CPU, queue depth. Cheap to store and aggregate; great for dashboards and alerts.
http_requests_total{method="GET",status="200"} 48213 34 of the words you'll hear in on-call, SRE reviews and incident retros — what each one means, an example or formula, and the gotcha you'll wish someone had told you.
Last reviewed:
metricA numeric measurement sampled over time — request rate, CPU, queue depth. Cheap to store and aggregate; great for dashboards and alerts.
http_requests_total{method="GET",status="200"} 48213 logA timestamped record of a discrete event. High detail, high volume — the thing you grep when an alert fires.
2026-05-29T10:02:11Z ERROR order_id=4821 payment_declined gateway=stripe traceThe end-to-end story of one request as it crosses services, made of nested spans. Answers "where did the time go?".
trace_id=ab12 → [api 120ms] → [db 80ms] → [cache 5ms] spanA single timed unit of work within a trace — one operation, with a start, duration, parent, and attributes.
span: "SELECT orders" parent=api duration=80ms db.system=postgres structured loggingLogging machine-parseable key/value records (usually JSON) instead of free-form text, so logs can be queried and filtered.
{"level":"error","msg":"payment failed","order_id":4821,"latency_ms":250} 💡 Include a trace_id field so logs link straight to the trace.
SLIService Level Indicator — the actual measured number you care about, e.g. the proportion of requests served under 300ms.
SLI = good_requests / valid_requests = 0.9987 SLOService Level Objective — the internal target for an SLI over a window, e.g. 99.9% of requests under 300ms over 30 days.
SLO: 99.9% availability over 30 days SLAService Level Agreement — a contractual promise to customers, with penalties. Usually looser than your internal SLO.
SLA: 99.5% uptime or we credit your account error budgetThe allowed amount of failure: 100% minus the SLO. A 99.9% SLO grants ~43 minutes of downtime per 30 days to spend on risk.
budget = (1 - 0.999) * 30d = 43.2 min/month 💡 When the budget is spent, you freeze risky deploys and focus on reliability.
burn rateHow fast you’re consuming the error budget relative to "on pace". A burn rate of 1 spends it exactly over the window; higher means trouble.
burn_rate = (error_rate) / (1 - SLO)
14.4x over 1h → page now 💡 Multi-window, multi-burn-rate alerts catch both fast outages and slow leaks.
golden signalsGoogle SRE’s four signals to watch on any user-facing service: latency, traffic, errors, and saturation.
latency (p99), traffic (req/s), errors (5xx %), saturation (% capacity) MTTRMean Time To Recover/Repair — average time from an incident starting to service being restored. The headline DORA reliability metric.
MTTR = total_downtime / number_of_incidents MTTDMean Time To Detect — average time between a problem starting and someone (or an alert) noticing it.
MTTD = sum(detect_time - start_time) / incidents MTTAMean Time To Acknowledge — average time from an alert firing to an on-call engineer acknowledging it.
MTTA = sum(ack_time - alert_time) / alerts MTBFMean Time Between Failures — average healthy interval between incidents. Higher is better.
MTBF = total_uptime / number_of_failures uptime / availability (nines)The fraction of time a service is up, quoted in "nines". Each extra nine cuts allowed downtime by 10x.
99.9% = 43.2 min/month
99.99% = 4.3 min/month
99.999% = 26 s/month cardinalityThe number of unique label combinations a metric has. High cardinality (e.g. user_id as a label) blows up storage and cost.
http_requests{user_id="..."} → millions of series = bad 💡 Keep IDs and free-form values out of metric labels; put them in traces or logs.
label / tagA key/value dimension attached to a metric (Prometheus calls it a "label"; Datadog/StatsD call it a "tag") so you can slice and filter.
http_requests_total{method="POST", route="/checkout"} samplingKeeping only a fraction of traces/logs to control cost. Head sampling decides up front; tail sampling decides after seeing the whole trace.
sample 100% of errors, 1% of healthy traces percentile (p50/p95/p99)The value below which that share of measurements fall. p99 latency = 99% of requests were faster than this. Averages hide the tail.
p50=80ms p95=240ms p99=900ms 💡 You can’t average percentiles across instances — aggregate from a histogram instead.
histogramA metric that buckets observations (e.g. request durations) so you can compute percentiles and heatmaps server-side.
http_duration_bucket{le="0.3"} 9821
http_duration_bucket{le="1.0"} 9990 gaugeA metric that can go up and down — a snapshot of a value right now (temperature, queue length, memory in use).
queue_depth 47 counterA metric that only ever increases (or resets to zero on restart). You take its rate() to get a per-second value.
rate(http_requests_total[5m]) instrumentationThe code that emits metrics, logs and traces. Can be manual (you add it) or automatic (an agent/library injects it).
span := tracer.Start(ctx, "checkout")
defer span.End() context propagationPassing trace identifiers (trace_id, span_id) across service boundaries — usually in HTTP headers — so spans link into one trace.
traceparent: 00-ab12...-cd34...-01 RED methodFor request-driven services, watch Rate, Errors, and Duration. A simpler, service-focused cousin of the golden signals.
Rate=req/s Errors=fail % Duration=p99 latency USE methodFor resources (CPU, disk, network), watch Utilisation, Saturation, and Errors. Brendan Gregg’s checklist for finding bottlenecks.
CPU: Utilisation 80%, Saturation (run-queue) 4, Errors 0 PrometheusThe de-facto open-source metrics database. Pulls (scrapes) metrics from targets and stores time series; queried with PromQL.
scrape_configs:
- job_name: api
static_configs:
- targets: ["api:9090"] PromQLPrometheus Query Language — selects and aggregates time series for dashboards and alerts.
sum(rate(http_requests_total{status=~"5.."}[5m])) by (route) GrafanaThe dashboarding and visualisation layer. Plugs into Prometheus, Loki, and many other data sources; also handles alerting.
Panel query: rate(http_requests_total[5m]) OpenTelemetry (OTEL)The vendor-neutral standard (and SDKs) for generating and exporting metrics, logs and traces. The Collector reshapes and ships them anywhere.
otel-collector → exporters: [otlp, prometheus, loki] JaegerAn open-source distributed tracing backend — stores and visualises traces so you can see span timing across services.
OTEL_EXPORTER_OTLP_ENDPOINT=http://jaeger:4317 LokiGrafana’s log database — indexes only labels, not full text, making it cheap. Queried with LogQL.
{app="api"} |= "error" | json | latency_ms > 500 DatadogA commercial all-in-one observability SaaS — metrics, logs, traces (APM), and dashboards behind one agent and UI.
DD_API_KEY=... datadog-agent run 💡 Convenient but billed largely on custom metrics and ingested log/trace volume — watch cardinality.