A postmortem document says: "The incident was not detected for 23 minutes because our three pillars of observability — metrics, logs, and traces — were not correlated." What does each pillar tell you and how do they differ?
The three pillars of observability:
1. Metrics — "What is happening right now?" Numerical measurements aggregated over time. Examples: request rate, error rate, CPU usage, p99 latency. • Stored efficiently as time series (timestamp + value + labels) • Alert-friendly: "error rate > 5% for 5 minutes → alert" • Tools: Prometheus, Datadog metrics, CloudWatch metrics • Weakness: no context for why a spike occurred
2. Logs — "What events occurred?" Time-stamped records of discrete events. Can be structured (JSON, key=value) or unstructured (plain text). • Rich context: user ID, request ID, parameters, stack traces • Tools: Elasticsearch + Kibana (ELK), Loki + Grafana, Splunk, Datadog Logs • Weakness: high volume, expensive to store; hard to see the big picture
3. Traces — "What happened across services for this request?" A trace follows a single request end-to-end through all services it called. • Shows latency at each service hop • Reveals cascading failures (service A called B called C — which one was slow?) • Tools: Jaeger, Zipkin, Tempo, Datadog APM, AWS X-Ray
Correlating the pillars — the workflow: 1. Metrics alert: p99 latency jumped to 2s 2. Logs filtered by time window: error logs show "database connection timeout" 3. Trace for a failing request: payment service → inventory service → DB query taking 1.8s
Vocabulary: • observability — the ability to understand a system's internal state from external outputs • telemetry — the data emitted by a system (metrics + logs + traces collectively) • cardinality — number of unique label value combinations in a metric • time series — a sequence of values indexed by time • APM (Application Performance Monitoring) — trace-focused monitoring
2 / 5
A data scientist asks: "Why do we use a Histogram metric for HTTP request latency instead of a Gauge?" What are the Prometheus metric types and what is each used for?
Prometheus metric types vocabulary:
Counter A value that only increases (or resets to 0 on restart). • Use for: total HTTP requests, total errors, total bytes transferred • PromQL: rate(http_requests_total[5m]) — requests per second over last 5 minutes • Never use Counter for values that can decrease (use Gauge for that)
Gauge A value that can go up and down arbitrarily. • Use for: current CPU %, memory used, active connections, queue depth, temperature • PromQL: go_goroutines — current goroutine count
Histogram Samples observations and counts them in configurable buckets. Also provides sum and count. • Creates three time series: _bucket{le="0.1"}, _sum, _count • Use for: request duration, request sizes • PromQL: histogram_quantile(0.99, rate(http_duration_seconds_bucket[5m])) → p99 latency • Buckets are configured at instrumentation time (e.g., [0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10])
Summary Similar to Histogram but calculates quantiles on the client (agent) side. More accurate but cannot be aggregated across instances. Prefer Histogram for distributed systems.
Vocabulary: • PromQL — Prometheus Query Language; used for alerting rules and Grafana panels • label — key-value metadata attached to a metric (method, route, status_code) • cardinality — number of unique label value combinations; high cardinality = storage/performance issue • scrape — Prometheus pulling metrics from a target endpoint (/metrics) • alert rule — PromQL expression that fires an alert when true for a duration • recording rule — pre-computed PromQL expression stored as a new metric (for expensive queries)
3 / 5
A Prometheus alert fires: "High cardinality metric detected — user_id label on http_requests_total has 2M unique values." Why is high cardinality a serious problem in observability systems?
Cardinality in observability vocabulary:
Cardinality The number of unique time series generated by a metric. Each unique combination of label values creates a separate time series.
Example: http_requests_total{method, route, status} • 3 methods × 50 routes × 10 status codes = 1,500 time series ✓ manageable
High cardinality example: http_requests_total{user_id} • 2M users = 2M time series × multiple metrics = millions of series ✗ dangerous
What goes wrong: • Prometheus stores each time series in memory (Head TSDB): 2M series × ~3 KB each = ~6 GB RAM just for one metric • Query performance degrades: aggregating over 2M series is slow • WAL (Write-Ahead Log) and storage bloat • In extreme cases: OOM (Out of Memory) crash, dropped scrapes, cascading failures
High-cardinality antipatterns: • User IDs, session tokens, IP addresses, order IDs as labels • Unlimited enumerable values (free-text error messages)
Solutions: • Remove high-cardinality labels from Prometheus; store them in logs instead • Use purpose-built high-cardinality tools: Honeycomb, Lightstep, Tempo (traces) • Limit label value set (e.g., bucket user IDs into cohorts)
Vocabulary: • time series — a sequence of (timestamp, value) pairs for one unique label set • label — key=value metadata attached to a metric • TSDB — Time Series Database; Prometheus's storage engine • exemplar — a sample attached to a metric data point that links to a trace ID (correlates metrics to traces) • remote write — Prometheus forwarding data to a long-term storage backend (Thanos, Cortex, Mimir)
4 / 5
An engineering team migrates to OpenTelemetry. The PR description says: "Added OTEL auto-instrumentation — traces now show spans with span context propagation across services." What is a trace, a span, and span context propagation?
OpenTelemetry distributed tracing vocabulary:
Trace The complete record of a single request as it flows through a distributed system. One trace = one user action (e.g., a checkout request) observed across all services it touched. Identified by a unique Trace ID.
Span One unit of work within a trace. Examples: "POST /checkout", "SELECT FROM orders", "call inventory-service". A span records: name, start time, duration, status (OK/ERROR), and attributes (key-value metadata). Spans form a tree: a parent span can have child spans (representing downstream calls).
Span context The metadata needed to link a span to its trace: trace ID + span ID + flags (sampled/not sampled).
Context propagation The mechanism of passing span context between services via HTTP headers (or message queue headers). Standard headers: • traceparent: 00-traceId-spanId-flags (W3C standard, used by OpenTelemetry) • X-B3-TraceId / X-B3-SpanId (Zipkin B3 format, legacy)
Without propagation, the trace is broken — each service records an isolated span with no parent.
OpenTelemetry (OTEL) vocabulary: • SDK — the language library for generating traces/metrics/logs • instrumentation — adding OTEL code to emit telemetry • auto-instrumentation — instruments libraries automatically without code changes (HTTP frameworks, DB drivers) • Collector — OTEL infrastructure component that receives, processes, and exports telemetry • exporter — sends telemetry to a backend (Jaeger, Zipkin, Tempo, Datadog) • OTLP (OpenTelemetry Protocol) — the standard protocol for telemetry data transport • baggage — user-defined values propagated across the trace (e.g., user tier, feature flag state)
5 / 5
A reliability engineer says: "Our SLO is 99.9% availability. Last month we had 45 minutes of downtime. We breached our error budget." What do SLI, SLO, and SLA mean, and what is an error budget?
SLI, SLO, SLA, and error budget vocabulary:
SLI (Service Level Indicator) A measured metric that quantifies service reliability. Must be user-centric. Good SLIs: "% of HTTP requests returning 2xx", "% of requests under 200ms", "availability minutes per month" Poor SLIs: CPU usage (doesn't directly reflect user experience)
SLO (Service Level Objective) An internal target for an SLI. A goal, not a contract. Example: "99.9% of requests succeed in any rolling 28-day window" SLOs are owned by the engineering team and used for operational decisions.
SLA (Service Level Agreement) A contractual commitment to customers or stakeholders, with defined consequences for breach (credits, refunds, contract termination). SLAs are typically set below SLOs to give a buffer: SLO = 99.9%, SLA = 99.5%.
Error budget The allowed amount of unreliability within the SLO period. Error budget = 100% - SLO
In this exercise: 45 minutes > 43.2 minute budget = SLO breached.
Why error budgets matter: • Error budget spent too fast → new feature releases slow down or freeze (stability over features) • Error budget unused → team can take more deployment/infrastructure risk • Error budget = a shared language between product and SRE/platform teams
Vocabulary: • rolling window — SLO measured over the last 28/30 days, not a fixed calendar month • error rate — % of requests that failed • burn rate — speed at which error budget is being consumed; >1x burn rate means you'll miss the SLO • toil — manual, repetitive operational work; SRE teams measure and reduce toil • availability nines — 99.9% = "three nines" = 43 min/month; 99.99% = "four nines" = 4.3 min/month