OpenTelemetry & Observability Vocabulary: The Three Pillars Explained
Learn observability vocabulary — traces, spans, metrics, logs, OpenTelemetry SDK, OTLP, sampling, SLOs, and the three pillars of observability explained for developers.
As systems become more distributed, understanding what is happening inside them becomes harder. Observability is the practice of understanding a system’s internal state from the data it produces. OpenTelemetry (OTel) is the open standard that unifies how applications emit that data. This guide explains the vocabulary you need to participate in observability discussions with your team.
Observability vs Monitoring
Monitoring
Monitoring checks predefined conditions — is the server up? Is CPU above 80%? It tells you that something is wrong based on metrics you decided to track in advance.
“Our monitoring alerts triggered — CPU on the API servers is above 90%.”
Observability
Observability is the broader ability to ask why something is wrong, even questions you didn’t anticipate. It relies on rich telemetry data — traces, metrics, and logs — that lets you explore an unfamiliar failure.
“Monitoring told us there was a problem. Observability helped us understand why the checkout service was slow — we traced it to a downstream payment API.”
The Three Pillars
The three pillars of observability are traces, metrics, and logs. Together they provide a complete picture of system behaviour.
Traces and Spans
Trace
A trace represents the complete journey of a single request as it travels through a distributed system. A trace is made up of multiple spans linked together by a shared trace ID.
“Pull up the trace for this request ID — it will show exactly which service is adding the latency.”
Span
A span represents a single unit of work within a trace — for example, one database query, one HTTP call, or one function execution. Each span records its start time, duration, and contextual metadata (attributes).
“The trace shows a span for the database query taking 800ms — that’s our bottleneck."
"Add a custom span around the file-processing logic so we can see how long it takes in production.”
Context Propagation
Context propagation is the mechanism for passing trace context (trace ID, span ID, flags) from one service to the next — typically via HTTP headers (traceparent, tracestate in W3C format). Without it, traces break across service boundaries.
“The trace ends at the edge service and doesn’t continue into the downstream API. Check that context propagation is configured on both sides.”
Baggage
Baggage is a set of user-defined key-value pairs attached to a trace context and propagated across service boundaries. It can carry information like userId or tenantId through the entire request journey.
“We attach the
tenantIdto the baggage so every span in the trace is tagged with it — makes filtering much easier.”
Metrics
Metric
A metric is a numeric measurement collected over time. The three main metric types in OpenTelemetry are:
- Counter — a value that only increases (e.g., total requests, total errors).
- Gauge — a value that can go up or down (e.g., current memory usage, active connections).
- Histogram — records the distribution of values (e.g., request latency percentiles).
“Create a counter metric for failed login attempts — we want to alert if it spikes."
"Use a histogram for request latency so we can see the p50, p95, and p99 in the dashboard.”
Exemplar
An exemplar is a sample data point attached to a metric that links it to a specific trace. It lets you jump from a spike in a histogram directly to the trace that caused it.
“The p99 latency spiked at 2 AM. Click the exemplar to see the trace from that exact moment.”
Logs
Log Correlation
Log correlation means attaching trace IDs and span IDs to log entries so you can jump between logs and the corresponding trace. This links the three pillars together.
“Add trace context to your logs — include
traceIdandspanIdin every log line so we can find the relevant logs from a trace.”
OpenTelemetry Architecture
OpenTelemetry SDK
The OpenTelemetry SDK is the language-specific library you include in your application to emit telemetry data (traces, metrics, logs). SDKs are available for most languages: Java, Python, Go, Node.js, .NET, and more.
“We’re using the OpenTelemetry Node.js SDK to instrument the API service.”
OpenTelemetry API
The OpenTelemetry API is the interface your application code uses to record spans and metrics. It is separate from the SDK so that libraries can use the API without depending on any specific SDK implementation.
“The library uses the OTel API — it will emit telemetry automatically if the SDK is configured in the application.”
OpenTelemetry Collector
The OpenTelemetry Collector is a standalone component that receives, processes, and exports telemetry data. It acts as a pipeline — receiving data from your apps and forwarding it to backends like Jaeger, Prometheus, or Datadog.
“We route all telemetry through the OTel Collector so we can change the backend without touching the application code.”
OTLP (OpenTelemetry Protocol)
OTLP is the standard protocol for sending telemetry data to the OpenTelemetry Collector or a compatible backend. It runs over gRPC or HTTP.
“Configure the SDK to export to the collector via OTLP on port 4317.”
Instrumentation
Instrumentation is the act of adding observability code to your application. There are two types:
- Auto-instrumentation — the SDK automatically instruments popular libraries (HTTP clients, database drivers, etc.) with no code changes.
- Manual instrumentation — you write code to create custom spans and record specific data.
“Auto-instrumentation handles the HTTP and database spans automatically. Add manual instrumentation for the business logic that matters to us.”
Sampling
Head-Based Sampling
Head-based sampling makes the sampling decision at the start of a trace (when the first span is created). It is simple and low-overhead, but the decision is made before you know whether the trace is interesting.
“We use head-based sampling at 10% — we’re dropping 90% of traces, which is fine for normal traffic.”
Tail-Based Sampling
Tail-based sampling makes the sampling decision at the end of a trace, after all spans have been collected. This lets you keep 100% of slow or error traces while sampling normal ones.
“Configure tail-based sampling in the Collector to always keep traces with errors or latency above 1 second.”
SLOs and Alerting
SLO (Service Level Objective)
An SLO is an internal target for service reliability — for example, “99.9% of requests should complete in under 200ms.” SLOs are derived from SLAs (Service Level Agreements) and drive error budget alerting.
“Our SLO is 99.5% availability. We’re burning through our error budget faster than expected this week."
"Alert when the error rate is high enough that we’ll miss our SLO within the next hour.”
How to Use This in Conversation
In an incident:
“Pull up the trace for the failing requests — the spans will show us exactly where the latency is coming from.”
In architecture review:
“We should add auto-instrumentation to the new service before it goes to production. Otherwise we’ll have no visibility when something goes wrong.”
In planning:
“Let’s set up tail-based sampling — right now we’re missing traces for the rare but critical error cases.”
In a postmortem:
“The logs didn’t have trace IDs, so we couldn’t correlate them with the traces. Let’s fix log correlation before the next incident.”
Observability vocabulary is increasingly expected of all engineers, not just SREs. Understanding these terms will help you build more observable systems and contribute meaningfully to reliability discussions.