5 exercises — Practice advanced observability vocabulary in English: OpenTelemetry, distributed tracing, cardinality, SLI/SLO, golden signals, and eBPF observability.
Advanced: eBPF, continuous profiling, flame graph, head sampling, tail sampling
0 / 5 completed
1 / 5
An SRE explains distributed tracing to a developer whose service is slow: "Your service calls Service B, which calls Service C and a database. The user sees 800ms. Without tracing, you don't know where the time goes. With OpenTelemetry distributed tracing, every request gets a trace ID. Each service creates a span — a named, timed unit of work. Spans are linked into a trace by the trace ID and parent span ID propagated in HTTP headers. You can visualize the full trace: your service took 50ms, Service B took 600ms, 400ms of which was the database query." What is a span in distributed tracing, and what information does it capture?
Span: the fundamental unit of a distributed trace. Contains: trace ID (shared across all spans for a request), span ID (unique per span), parent span ID (links to parent in the tree), operation name, service name, start timestamp, duration, status (OK/Error/Unset), attributes (key-value metadata), events (timestamped log entries within the span), links (references to other traces). Trace: a directed acyclic graph (DAG) of spans representing the full lifecycle of a request across services. Root span: the entry point (usually the user-facing service). Child spans: created by downstream services. Visualization: Gantt chart — time on the x-axis, each span a bar. Critical path: the longest chain of spans — the bottleneck. OpenTelemetry vocabulary: OTel (OpenTelemetry): CNCF project; vendor-neutral instrumentation standard. Covers traces, metrics, logs. OTLP (OpenTelemetry Protocol): gRPC/HTTP protocol for sending telemetry data to a collector/backend. OTel Collector: agent/gateway that receives, processes (sample, filter, transform), and exports telemetry to multiple backends. Auto-instrumentation: instrumenting a service without code changes — using agents (Java agent, .NET profiler) or eBPF. Manual instrumentation: developers add spans explicitly in code. Span context propagation: passing trace ID + span ID in HTTP headers (W3C Traceparent standard) so downstream services can link their spans to the parent trace. In conversation: 'Before distributed tracing, debugging a slow API call across 8 microservices meant guessing where the time went. Now we see the full call tree in Jaeger and go directly to the slow span.'
2 / 5
A platform engineer explains cardinality to a developer who caused an outage: "You added a user_id label to your HTTP request metrics. That metric now has one series per user. You have 2 million users — you just created 2 million metric series. Prometheus ran out of memory and restarted. This is label explosion. Cardinality is the number of unique values a label can take. High-cardinality labels — user IDs, request IDs, trace IDs, email addresses — must never be used in metrics. That's what logs and traces are for." What is high cardinality in the context of metrics, and why does it cause problems?
Metric series: each unique combination of metric name + label values is a separate time series stored in the database. Example: http_requests_total{service="api", status="200", user_id="u-12345"} is one series. Cardinality problem: user_id can be 2 million values. status can be 5 values. Together: 10 million series per service. Prometheus stores each series separately — memory grows proportionally to series count. High-memory usage → OOM → monitoring outage. Rule: good labels have bounded cardinality (status code: ~10 values, HTTP method: ~5 values, service name: ~100 values). Bad labels: user IDs, session IDs, transaction IDs, request IDs, emails. Observability signals for high-cardinality: Logs: structured logs can include user_id in each log line — OK because logs are stored as events, not as separate indexed series. Traces: span attributes can include user_id — OK because traces are stored per-request, not aggregated. Exemplars: a link from a metric aggregate to a specific trace that contributed to it. Connects low-cardinality metrics to high-cardinality trace details. Vocabulary: Label explosion: adding a high-cardinality label, causing the series count to explode. Metric series: a unique time series identified by metric name + all label values. Dimensionality: the number of labels on a metric. High dimensionality increases the total series count. Recording rule: pre-compute expensive aggregations into new metrics — reduces query time. In conversation: 'The rule of thumb: if a label value comes from user input or has unbounded unique values, it belongs in logs or traces, not metrics.'
3 / 5
An SRE explains the four golden signals and how they relate to SLOs: "Google's Site Reliability Engineering book defines four golden signals to monitor for any service: latency (how long requests take), traffic (how many requests), errors (rate of failed requests), and saturation (how full your system is — CPU, memory, queue depth). These map directly to SLI candidates. Our SLOs are defined in terms of these: 99.9% of requests complete in under 500ms (latency SLO), and 99.95% of requests succeed (error rate SLO)." What is the RED method and how does it differ from the USE method?
RED method (Tom Wilkie): for every service, monitor: Rate — requests per second. Errors — error rate (% or count). Duration — distribution of response times (p50, p95, p99). User-oriented: measures how the service appears to callers. Best for: microservices, APIs, HTTP services. USE method (Brendan Gregg): for every resource (CPU, memory, disk, network), monitor: Utilization — % of time the resource is busy. Saturation — amount of queued/waiting work. Errors — error count for that resource. Resource-oriented: measures infrastructure health. Best for: nodes, databases, queues. Four golden signals comparison: Latency (= RED Duration), Traffic (= RED Rate), Errors (= RED Errors + USE Errors), Saturation (= USE Saturation). SLO vocabulary: SLI (Service Level Indicator): a metric you measure (request success rate). SLO (Service Level Objective): target for an SLI (99.9% success over 30 days). Error budget: (1 - SLO) × time period — allowable failure. Burn rate: how fast error budget is being consumed. Multi-window alerting: alert on burn rate across short (1h) and long (6h) windows to catch both acute and slow-burning issues. In conversation: 'RED gives you the user experience; USE gives you why. Start with RED for SLOs — users don't care about CPU utilization, they care about latency and errors.'
4 / 5
A senior engineer explains sampling strategies: "We instrument every request but we don't store every trace — at 10K requests/second, that's 864M traces/day. We sample. Head-based sampling: decide at the root span whether to keep the trace. Fast but you miss rare errors since you decide before you know the outcome. Tail-based sampling: buffer the entire trace, then decide based on the outcome. Found an error? Keep it. Slow request? Keep it. Normal request? Discard with 99% probability. Better data quality, more complex infrastructure." What is tail-based sampling and why is it better for debugging rare errors?
Head-based sampling: decision made at trace start. Types: probabilistic (keep 1% of all traces randomly), rate-limited (keep N traces/second). Problem: errors and slow requests are rare — a 1% sample might keep mostly normal requests, discarding the one error trace you needed. Tail-based sampling: all spans are buffered (in the OTel Collector or a dedicated component). After the trace is complete, apply rules: always keep if status=error, always keep if duration > 1s, keep 1% of the rest. Infrastructure requirements: spans from all services must be routed to the same collector instance (or group) to reassemble the complete trace. More complex but much better signal quality. Sampling vocabulary: Sampling rate: the probability of keeping a trace. 1% = keep 1 in 100. Adaptive sampling: automatically adjusts sample rate based on volume to stay within budget. Consistent sampling: the same trace ID always produces the same sampling decision across services — prevents incomplete traces. Instrumentation library: language-specific library for adding OTel instrumentation. SDK: OTel SDK — language library that manages trace/span creation and export. Exporter: OTel component that sends telemetry to a backend (Jaeger, Zipkin, Tempo, Datadog). Continuous profiling vocabulary: Continuous profiling: always-on CPU profiling in production (Parca, Pyroscope). Flame graph: visualization of call stack frequency — wide bars = more CPU time. In conversation: 'Head sampling is simple to implement; tail sampling is what you need once you're debugging real production issues. The traces you want are exactly the ones a random sampler would throw away.'
5 / 5
A platform engineer presents eBPF-based observability: "eBPF — extended Berkeley Packet Filter — lets you run sandboxed programs inside the Linux kernel without modifying the kernel source or loading kernel modules. For observability, this is transformative: we can attach probes to any kernel function, any system call, any network packet — with near-zero overhead. Tools like Cilium Hubble give us network-level observability without any application changes. Pixie profiles every service automatically. No sidecars, no code changes, no restarts." What makes eBPF valuable for observability, and what are its limitations?
eBPF (extended Berkeley Packet Filter): a technology allowing user-defined programs to run inside the Linux kernel safely. Programs are verified by the kernel verifier before loading — prevents crashes and infinite loops. Zero-code observability: instrument any application (Go, Java, Python, Node.js) at the kernel level without modifying code. Captures: system calls (file I/O, network), function calls in any language (via uprobes), network packets (before/after network stack), CPU cycles per function (profiling). eBPF observability tools: Cilium: Kubernetes CNI plugin + network policy + Hubble observability (service map, network flows, DNS queries) — all via eBPF. Pixie: automatic profiling and tracing for Kubernetes — no code changes, no sidecars. Falco: eBPF-based security monitoring (detects unexpected syscalls). BPFtrace: scripting language for writing custom eBPF probes. Limitations: requires Linux kernel 4.15+ (full feature set: 5.8+). Not available on Windows or older kernels. Requires elevated privileges (CAP_BPF). Limited observability into encrypted application-layer payloads (sees metadata, not decrypted content unless at the application layer). Sidecar vs. eBPF: Sidecar: container injected alongside your service. Language-agnostic. Requires container restart on injection. eBPF: no sidecar, no restart, works at kernel level. In conversation: 'eBPF is the most exciting observability development in years — it decouples instrumentation from deployment. The catch: your team needs kernel expertise to debug eBPF-level issues.'