OpenTelemetry & Observability
Distributed tracing, metrics, and logging with OpenTelemetry: spans, context propagation, sampling, and the collector.
- trace /treɪs/
End-to-end record of a request's path through distributed services; composed of a root span and child spans linked by a shared trace ID.
"The trace showed the request spent 850ms in the database query and only 20ms in application code — immediately pinpointing the bottleneck."
- span /spæn/
Unit of work within a trace representing a single operation; has a name, start and end timestamps, status, and a set of key-value attributes.
"Every database call creates a child span with attributes for the SQL query and row count so we can identify slow queries in the trace waterfall."
- context propagation /ˈkɒntekst ˌprɒpəˈɡeɪʃən/
Mechanism for passing trace context across service boundaries via HTTP headers such as traceparent (W3C standard) and baggage.
"Context propagation ensures the span created in the frontend is a child of the root trace started in the API gateway — the full call chain is visible."
- baggage /ˈbæɡɪdʒ/
Key-value pairs attached to a trace context and propagated across all service boundaries; useful for tenant IDs, feature flag values, and user segments.
"We propagate tenant_id as baggage so every downstream service can include it in logs and metrics without each service reading the JWT independently."
- counter /ˈkaʊntər/
Monotonically increasing metric instrument; used for measuring totals such as request counts, errors, and bytes sent.
"We increment the http.server.request.count counter on every request and filter by status code attribute to calculate error rates in dashboards."
- gauge /ɡeɪdʒ/
Metric instrument whose value can increase or decrease; used for measuring current state such as queue depth, memory usage, and active connections.
"A gauge tracks the number of active WebSocket connections in real time — it rises as users connect and falls as they disconnect."
- histogram /ˈhɪstəɡræm/
Metric instrument that records observations in configurable buckets and computes totals and sums; used for request duration and payload size distributions.
"The histogram for API latency shows p50=45ms, p95=230ms, and p99=800ms — the long tail was invisible in the average metric we used previously."
- structured log /ˈstrʌktʃəd lɒɡ/
Log entry formatted as JSON with consistent fields such as timestamp, level, service, and trace_id; machine-readable and correlatable with traces.
"Switching to structured JSON logs lets us query by trace_id in our log aggregator and jump directly from a log line to the full distributed trace."
- correlation ID /ˌkɒrəˈleɪʃən aɪˈdiː/
Unique identifier propagated through every service in a request chain; used to join logs, traces, and metrics from different services for the same request.
"We log the correlation ID at every layer so support engineers can grep a single ID across five microservices and reconstruct the full request timeline."
- auto-instrumentation /ˌɔːtəʊ ˌɪnstrəmenˈteɪʃən/
SDK agent that automatically captures telemetry from popular libraries (HTTP clients, database drivers) without requiring manual code changes.
"Auto-instrumentation for Express and pg gave us traces for every HTTP request and database query on day one — before we wrote a single manual span."
- OTLP /əʊ tiː el piː/
OpenTelemetry Protocol; the standard binary and HTTP wire format for exporting traces, metrics, and logs to collectors and observability backends.
"All our services export via OTLP to the OpenTelemetry Collector, which fans out to Jaeger for traces and Prometheus for metrics."
- OpenTelemetry Collector /ˌəʊpən teˈlɪmetri kəˈlektər/
Vendor-agnostic proxy that receives telemetry via OTLP or other protocols, processes it (filtering, batching, enriching), and exports it to one or more backends.
"The Collector enriches every span with the Kubernetes pod name and namespace before forwarding to our backend — no code changes needed in the services."
- sampling /ˈsɑːmplɪŋ/
Strategy to reduce telemetry volume by recording only a fraction of traces; head sampling decides at trace start, tail sampling decides after the trace completes based on outcome.
"We use tail-based sampling to keep 100% of error traces and 1% of successful traces — errors are rare but always captured in full."
- exemplar /ɪɡˈzemplər/
A sample data point linked from a histogram metric bucket to the trace that produced it; connects a high-latency metric data point to the specific trace you can inspect.
"Clicking the exemplar on the p99 latency bucket opened the exact trace that caused the spike — without exemplars we would have had to search through millions of traces."
- semantic convention /sɪˈmæntɪk kənˈvenʃən/
Standard attribute naming rules defined by OpenTelemetry (e.g. http.method, db.system); ensures telemetry from different services and languages is consistent.
"Following semantic conventions means our Grafana dashboards work with traces from both the Python and Go services — they all use the same attribute names."
Quick Quiz — OpenTelemetry & Observability
Test yourself on these 15 terms. You'll answer 10 multiple-choice questions — each shows a term, you pick the correct definition.
What does this term mean?