Trace: the end-to-end view of one request. When a user clicks "Buy", a trace records: API gateway (10ms), Order service (50ms), Inventory service (30ms), Payment service (200ms), Notification service (20ms) — with their relationships (parent-child), timing, and outcomes. Components: Trace: the entire DAG of spans for one request. Identified by trace ID (16 bytes, globally unique). Span: a named, timed operation. Has: span ID, parent span ID, start time, duration, status (OK/ERROR), attributes (key-value). Root span: the first span, with no parent. Spans form a tree showing the call hierarchy. OpenTelemetry: the CNCF standard for trace instrumentation. SDK auto-instruments HTTP, gRPC, DB calls in many frameworks.
2 / 5
What is context propagation in distributed tracing?
Context propagation: the mechanism that ties together all the spans from different services into one trace. Standards: W3C Trace Context (RFC 9209): traceparent: 00-4bf92f3577b34da6a3ce929d0e0e4736-00f067aa0ba902b7-01 format: version-traceID-parentSpanID-flags. Now the recommended standard. B3 (Zipkin): X-B3-TraceId, X-B3-SpanId, X-B3-Sampled headers. Still widely used. Application responsibility: HTTP frameworks auto-inject/extract headers (if instrumented). For async (Kafka messages, SQS), embed trace context in message headers. The critical failure mode: a service receives traceparent, processes the request, makes downstream calls — but forgets to forward the header. The trace breaks. The downstream spans have a new trace ID. Root cause: missing header propagation in one service.
3 / 5
What is sampling in distributed tracing and why is it necessary?
Sampling: at 10,000 RPS recording every trace produces 864M traces/day. Impossible to store and process. Sampling strategies: Head-based (probabilistic): decision made at request entry. 1% or 5% of requests traced. Simple. Misses rare errors. Tail-based: collect all spans temporarily; after the request completes, decide to keep it based on outcome (error, slow). Captures 100% of errors even at 0.1% base sampling. More complex — requires buffering. Rate limiting: keep N traces/second regardless of load. Adaptive: lower sampling during high load. OpenTelemetry sampler config: ParentBased (respect upstream sampling decision), TraceIdRatioBased (probabilistic), AlwaysOn/AlwaysOff. Honeycomb, Lightstep: support tail-based sampling. Jaeger: configurable sampling per service.
4 / 5
What are span attributes and what should they contain?
Span attributes: OpenTelemetry defines semantic conventions for common attributes. HTTP spans: http.method (GET/POST), http.url, http.status_code, http.request_content_length. DB spans: db.system (postgresql), db.statement (the SQL query — careful with PII), db.name. RPC spans: rpc.system (grpc), rpc.service, rpc.method. Custom attributes: user.id, tenant.id, feature_flag.key, order.id. Events: time-stamped messages within a span (e.g., "cache miss", "retry attempt 2"). Status: OK, ERROR (+ description). Links: span can reference other spans (for async, fan-out patterns). High-cardinality attributes (user ID, order ID) enable finding the trace for a specific user complaint.
5 / 5
What is the difference between distributed tracing, metrics, and logs in observability (the "three pillars")?
Three pillars (observability signals): Metrics: aggregated numeric measurements over time. Cheap to store (just numbers). Great for: alerting (error rate > 1%), trending, dashboards. Cannot answer "why did this specific request fail?". Examples: request_duration_seconds (histogram), error_rate_total (counter), active_connections (gauge). Prometheus/Grafana. Logs: discrete event records with context. Expensive at scale. Great for: debugging specific events, audit trails. Cannot show causality across services. Examples: JSON logs with trace ID, request ID, user ID. Elasticsearch/Loki. Traces: causal request flow across services. Can show which downstream call is slow. Cannot show aggregate patterns. Best used to debug specific slow/failed requests identified via metrics. The key: use metrics to find the problem, traces to understand the causal chain, logs to see the details at each step. OpenTelemetry unifies collection of all three.