Vocabulary for Observability and Monitoring in DevOps
A comprehensive guide to English vocabulary for observability and monitoring in DevOps — metrics, logs, traces, SLOs, alerting, and on-call communication.
Observability is the practice of understanding the internal state of a system by examining its outputs. As teams move toward distributed systems and microservices, the vocabulary of observability has expanded rapidly. Whether you are configuring dashboards, triaging an incident, or presenting reliability metrics to stakeholders, you need precise English vocabulary to communicate clearly.
The Three Pillars of Observability
The observability community commonly refers to three core data types that together give you visibility into a system’s behaviour.
1. Metrics
Metrics are numerical measurements collected over time. They are the backbone of monitoring dashboards.
- Time series — a sequence of data points indexed by time, e.g., request rate per second
- Gauge — a metric that represents a value at a specific moment, such as current memory usage
- Counter — a metric that only increases, such as total number of requests served
- Histogram — a metric that tracks the distribution of values, often used for latency
- Percentile (P50, P95, P99) — the value below which a percentage of observations fall; P99 latency means 99% of requests are faster than that value
- Cardinality — the number of unique label combinations in a metric; high cardinality can strain monitoring infrastructure
- Scrape interval — how frequently a metrics system polls a target for new data
2. Logs
Logs are time-stamped records of discrete events within a system.
- Structured logging — writing logs as machine-readable key-value pairs (e.g., JSON) rather than plain text strings
- Log level — the severity of a log entry: DEBUG, INFO, WARN, ERROR, FATAL
- Log aggregation — collecting logs from multiple sources into a central system (e.g., Elasticsearch, Loki)
- Log retention — how long log data is kept before deletion or archival
- Correlation ID — a unique identifier attached to a request as it flows through multiple services, enabling trace reconstruction from logs
3. Traces
Distributed tracing tracks a request as it travels across multiple services.
- Trace — the full record of a single request’s journey through the system
- Span — a single unit of work within a trace, representing one operation in one service
- Parent span / child span — the hierarchical relationship between operations in a trace
- Trace context propagation — passing trace identifiers between services so spans can be linked
- Sampling — recording only a fraction of traces to manage storage costs while maintaining statistical insight
- Flame graph — a visual representation of a trace showing which operations took the most time
Key Vocabulary: Reliability and SLOs
- SLI (Service Level Indicator) — a specific, measurable metric used to assess service performance, e.g., request success rate
- SLO (Service Level Objective) — the target value for an SLI, e.g., “99.9% of requests succeed”
- SLA (Service Level Agreement) — a contractual commitment to external customers based on SLOs
- Error budget — the allowable amount of downtime or errors within an SLO period; if the SLO is 99.9%, the error budget is 0.1%
- Burn rate — how quickly the error budget is being consumed; a high burn rate signals a service health problem
- MTTR (Mean Time to Recover) — the average time to restore service after an incident
- MTBF (Mean Time Between Failures) — the average time between incidents
- Availability — the percentage of time a service is operational and accessible
Key Vocabulary: Alerting and On-Call
- Alert — a notification triggered when a metric crosses a defined threshold
- Threshold — the value at which an alert fires, e.g., “alert if error rate exceeds 5%”
- Alert fatigue — the desensitisation that occurs when engineers receive too many low-quality alerts
- False positive — an alert that fires when there is no real problem
- False negative — a real problem that does not trigger an alert
- On-call rotation — a schedule assigning responsibility for responding to alerts to team members in turn
- Escalation policy — the defined process for notifying additional responders if an alert is not acknowledged within a time limit
- Runbook — a documented procedure for responding to a specific alert or incident type
- Silencing / muting — temporarily disabling an alert during planned maintenance
Talking About Observability in Team Conversations
Discussing Dashboard Health
- “The P99 latency on the checkout service has been trending upward for the past three hours.”
- “We’re seeing elevated error rates on the payment gateway — roughly 2.3%, which is above our SLO threshold of 1%.”
- “The trace data shows that 80% of the latency is coming from a single downstream call to the inventory service.”
- “Our error budget is at 23% remaining for this month. We need to be careful about what we ship this week.”
During an Incident
- “I’m seeing a spike in the 5xx rate starting at 14:32 UTC. All three instances in eu-west-1 are affected.”
- “The flame graph shows the slow span is in the database query layer — specifically the product search endpoint.”
- “Correlation ID trace shows the request is timing out waiting for the recommendations service.”
- “We’ve silenced the secondary alerts so we can focus on the primary incident. Escalation is active.”
In a Post-Incident Review
- “MTTR for this incident was 47 minutes — above our 30-minute target. Let’s look at the detection and response timeline.”
- “The alert fired correctly, but the runbook didn’t cover this failure mode, which added to the resolution time.”
- “We had two false positives this week that created noise during the incident. We’ll tune those thresholds.”
Tools Vocabulary
Observability conversations often reference specific tool categories:
- APM (Application Performance Monitoring) — tools like Datadog, New Relic, Dynatrace that provide end-to-end performance visibility
- TSDB (Time Series Database) — specialised databases for metrics, such as Prometheus or InfluxDB
- Observability platform — an integrated suite combining metrics, logs, and traces, e.g., Grafana Stack, Honeycomb, Datadog
- OpenTelemetry (OTel) — an open standard for collecting and exporting observability data across languages and vendors
Understanding and using this vocabulary precisely will make you a more effective participant in reliability discussions, incident response, and architectural reviews. Observability is increasingly a core engineering competency — and the language to discuss it is part of that competency.