Advanced Vocabulary #observability #prometheus #otel #sre

Observability & Monitoring Vocabulary

5 exercises — three pillars (metrics/logs/traces), Prometheus metric types, cardinality, OpenTelemetry spans and context propagation, and SLI/SLO/SLA/error budget.

0 / 5 completed

Observability vocabulary quick reference

Metrics = what (numbers over time) · Logs = what happened (events) · Traces = how (request journey)
Counter = only increases · Gauge = up and down · Histogram = buckets (for latency p99)
Cardinality = unique time series count; high cardinality (user_id labels) = OOM risk
Trace = end-to-end request · Span = one unit of work · context propagation = passing trace/span IDs across services
SLI = measured metric · SLO = internal target · SLA = customer contract · error budget = 100% - SLO

1 / 5

A postmortem document says: "The incident was not detected for 23 minutes because our three pillars of observability — metrics, logs, and traces — were not correlated." What does each pillar tell you and how do they differ?

2 / 5

A data scientist asks: "Why do we use a Histogram metric for HTTP request latency instead of a Gauge?" What are the Prometheus metric types and what is each used for?

3 / 5

A Prometheus alert fires: "High cardinality metric detected — user_id label on http_requests_total has 2M unique values." Why is high cardinality a serious problem in observability systems?

4 / 5

An engineering team migrates to OpenTelemetry. The PR description says: "Added OTEL auto-instrumentation — traces now show spans with span context propagation across services." What is a trace, a span, and span context propagation?

5 / 5

A reliability engineer says: "Our SLO is 99.9% availability. Last month we had 45 minutes of downtime. We breached our error budget." What do SLI, SLO, and SLA mean, and what is an error budget?