Vocabulary for Observability and Monitoring in DevOps

Observability is the practice of understanding the internal state of a system by examining its outputs. As teams move toward distributed systems and microservices, the vocabulary of observability has expanded rapidly. Whether you are configuring dashboards, triaging an incident, or presenting reliability metrics to stakeholders, you need precise English vocabulary to communicate clearly.

The Three Pillars of Observability

The observability community commonly refers to three core data types that together give you visibility into a system’s behaviour.

1. Metrics

Metrics are numerical measurements collected over time. They are the backbone of monitoring dashboards.

Time series — a sequence of data points indexed by time, e.g., request rate per second
Gauge — a metric that represents a value at a specific moment, such as current memory usage
Counter — a metric that only increases, such as total number of requests served
Histogram — a metric that tracks the distribution of values, often used for latency
Percentile (P50, P95, P99) — the value below which a percentage of observations fall; P99 latency means 99% of requests are faster than that value
Cardinality — the number of unique label combinations in a metric; high cardinality can strain monitoring infrastructure
Scrape interval — how frequently a metrics system polls a target for new data

2. Logs

Logs are time-stamped records of discrete events within a system.

Structured logging — writing logs as machine-readable key-value pairs (e.g., JSON) rather than plain text strings
Log level — the severity of a log entry: DEBUG, INFO, WARN, ERROR, FATAL
Log aggregation — collecting logs from multiple sources into a central system (e.g., Elasticsearch, Loki)
Log retention — how long log data is kept before deletion or archival
Correlation ID — a unique identifier attached to a request as it flows through multiple services, enabling trace reconstruction from logs

3. Traces

Distributed tracing tracks a request as it travels across multiple services.

Trace — the full record of a single request’s journey through the system
Span — a single unit of work within a trace, representing one operation in one service
Parent span / child span — the hierarchical relationship between operations in a trace
Trace context propagation — passing trace identifiers between services so spans can be linked
Sampling — recording only a fraction of traces to manage storage costs while maintaining statistical insight
Flame graph — a visual representation of a trace showing which operations took the most time

Key Vocabulary: Reliability and SLOs

SLI (Service Level Indicator) — a specific, measurable metric used to assess service performance, e.g., request success rate
SLO (Service Level Objective) — the target value for an SLI, e.g., “99.9% of requests succeed”
SLA (Service Level Agreement) — a contractual commitment to external customers based on SLOs
Error budget — the allowable amount of downtime or errors within an SLO period; if the SLO is 99.9%, the error budget is 0.1%
Burn rate — how quickly the error budget is being consumed; a high burn rate signals a service health problem
MTTR (Mean Time to Recover) — the average time to restore service after an incident
MTBF (Mean Time Between Failures) — the average time between incidents
Availability — the percentage of time a service is operational and accessible

Key Vocabulary: Alerting and On-Call

Alert — a notification triggered when a metric crosses a defined threshold
Threshold — the value at which an alert fires, e.g., “alert if error rate exceeds 5%”
Alert fatigue — the desensitisation that occurs when engineers receive too many low-quality alerts
False positive — an alert that fires when there is no real problem
False negative — a real problem that does not trigger an alert
On-call rotation — a schedule assigning responsibility for responding to alerts to team members in turn
Escalation policy — the defined process for notifying additional responders if an alert is not acknowledged within a time limit
Runbook — a documented procedure for responding to a specific alert or incident type
Silencing / muting — temporarily disabling an alert during planned maintenance

Talking About Observability in Team Conversations

Discussing Dashboard Health

“The P99 latency on the checkout service has been trending upward for the past three hours.”
“We’re seeing elevated error rates on the payment gateway — roughly 2.3%, which is above our SLO threshold of 1%.”
“The trace data shows that 80% of the latency is coming from a single downstream call to the inventory service.”
“Our error budget is at 23% remaining for this month. We need to be careful about what we ship this week.”

During an Incident

“I’m seeing a spike in the 5xx rate starting at 14:32 UTC. All three instances in eu-west-1 are affected.”
“The flame graph shows the slow span is in the database query layer — specifically the product search endpoint.”
“Correlation ID trace shows the request is timing out waiting for the recommendations service.”
“We’ve silenced the secondary alerts so we can focus on the primary incident. Escalation is active.”

In a Post-Incident Review

“MTTR for this incident was 47 minutes — above our 30-minute target. Let’s look at the detection and response timeline.”
“The alert fired correctly, but the runbook didn’t cover this failure mode, which added to the resolution time.”
“We had two false positives this week that created noise during the incident. We’ll tune those thresholds.”

Tools Vocabulary

Observability conversations often reference specific tool categories:

APM (Application Performance Monitoring) — tools like Datadog, New Relic, Dynatrace that provide end-to-end performance visibility
TSDB (Time Series Database) — specialised databases for metrics, such as Prometheus or InfluxDB
Observability platform — an integrated suite combining metrics, logs, and traces, e.g., Grafana Stack, Honeycomb, Datadog
OpenTelemetry (OTel) — an open standard for collecting and exporting observability data across languages and vendors

Understanding and using this vocabulary precisely will make you a more effective participant in reliability discussions, incident response, and architectural reviews. Observability is increasingly a core engineering competency — and the language to discuss it is part of that competency.

Vocabulary for Observability and Monitoring in DevOps

The Three Pillars of Observability

1. Metrics

2. Logs

3. Traces

Key Vocabulary: Reliability and SLOs

Key Vocabulary: Alerting and On-Call

Talking About Observability in Team Conversations

Discussing Dashboard Health

During an Incident

In a Post-Incident Review

Tools Vocabulary

What to Read Next

Practice This Vocabulary

IT Collocations Drills

Interview Preparation

IT Vocabulary Modules

The Three Pillars of Observability

1. Metrics

2. Logs

3. Traces

Key Vocabulary: Reliability and SLOs

Key Vocabulary: Alerting and On-Call

Talking About Observability in Team Conversations

Discussing Dashboard Health

During an Incident

In a Post-Incident Review

Tools Vocabulary

Related Articles

What to Read Next

Practice This Vocabulary

IT Collocations Drills

Interview Preparation

IT Vocabulary Modules