Vocabulary for Talking About Observability and Monitoring
Essential English vocabulary for observability and monitoring: metrics, logs, traces, SLOs, alerting, and the phrases engineers use to discuss system health.
Observability is how teams understand what their systems are actually doing in production. The field has a dense vocabulary — metrics, traces, SLOs, percentiles — and using it precisely marks you out as someone who knows operations. This guide covers the essential terms, common phrases, and example sentences for discussing system health in English.
The Three Pillars
| Term | Meaning |
|---|---|
| Metrics | Numerical measurements over time (e.g. requests per second). |
| Logs | Timestamped records of events. |
| Traces | The path of a single request through your services. |
“The metrics show a latency spike, but I need to look at the traces to see which service is slow.”
These three are often called “the three pillars of observability” — a phrase worth knowing.
Metrics Vocabulary
- Latency — how long a request takes.
- Throughput — how many requests per second.
- Error rate — the percentage of failing requests.
- Saturation — how full a resource is (CPU, memory, disk).
- Percentile (p50, p95, p99) — the value below which X% of requests fall.
“Our p99 latency is 800ms, which means 1% of users are waiting nearly a second.”
Percentiles matter because averages hide the worst experiences. Saying “p95” shows you understand that.
SLOs, SLAs and SLIs
These three are easy to confuse, so be precise:
| Term | Meaning |
|---|---|
| SLI | Service Level Indicator — what you measure (e.g. uptime). |
| SLO | Service Level Objective — your internal target (e.g. 99.9%). |
| SLA | Service Level Agreement — a contractual promise to customers. |
| Error budget | How much unreliability you can “spend” before breaching the SLO. |
“We’ve burned through most of our error budget this month, so we should pause risky deploys.”
Alerting Vocabulary
- to fire an alert — when an alert triggers.
- to page someone — to wake the on-call engineer.
- alert fatigue — being overwhelmed by too many alerts.
- a flapping alert — one that triggers and clears repeatedly.
- a noisy alert — one that fires too often to be useful.
“This alert is too noisy — it’s paging us at 3am for a non-issue. Let’s tune the threshold.”
Describing System Behaviour
| Phrase | Meaning |
|---|---|
| ”It’s degraded.” | Working but slow or partial. |
| ”It’s flapping.” | Switching between healthy and unhealthy. |
| ”We’re seeing elevated error rates.” | More errors than normal. |
| ”It’s saturated.” | A resource is at capacity. |
| ”There’s a memory leak.” | Memory use grows over time. |
“The service is degraded — it’s up, but response times are double the baseline.”
The word “baseline” (normal level) is essential for comparing current behaviour to usual.
Verbs You’ll Use Constantly
- to instrument code (add observability to it)
- to scrape metrics (collect them)
- to correlate logs and traces
- to drill down into a metric
- to dashboard something (informal: put it on a dashboard)
- to alert on a condition
“We need to instrument the checkout flow so we can trace where the latency is coming from.”
Useful Phrases in an Investigation
“Let me drill down into the p99 by endpoint.” “The error rate started climbing right after the 14:00 deploy.” “I can’t correlate these logs without a trace ID — let’s add one.” “The dashboard’s showing a clear spike, but the cause isn’t obvious yet.”
Words People Confuse
| Confused | Clarification |
|---|---|
| Monitoring vs observability | Monitoring watches known problems; observability helps explore unknown ones. |
| Logs vs traces | Logs are events; traces follow one request across services. |
| Latency vs throughput | Latency is speed per request; throughput is volume. |
| SLO vs SLA | SLO is your internal goal; SLA is the customer contract. |
A Sentence to Practise
“Our SLI is request latency, our SLO is p95 under 300ms, and we’ve nearly exhausted this quarter’s error budget — so I’d recommend freezing risky changes and focusing on reliability until it recovers.”
Delivering that fluently signals real operational maturity.
Hedging and Uncertainty
In an incident you’re often unsure. English has precise hedges:
- “The metrics suggest a database bottleneck.”
- “It looks like a memory leak, but I haven’t confirmed it.”
- “We’re fairly confident the deploy caused this.”
With this vocabulary you can move fluently through any observability discussion — from describing a degraded service, to drilling into a p99 spike, to debating whether you’ve blown your error budget. Use the example sentences as templates, keep your SLIs, SLOs and SLAs straight, and hedge honestly when you’re still investigating.