Vocabulary for Site Reliability Engineers
Essential SRE vocabulary explained in plain English: SLO, error budget, toil, blameless postmortem, runbook, and more — with usage examples.
Site Reliability Engineering (SRE) has its own distinct vocabulary — a mix of traditional operations terminology, Google-originated concepts, and software engineering language. For non-native English speakers moving into SRE roles or working alongside SRE teams, mastering this vocabulary is essential for both daily communication and career growth.
This guide covers the core SRE vocabulary with clear definitions, usage examples, and notes on how each term is used in practice.
Service Level Concepts
SLI (Service Level Indicator)
An SLI is a specific metric that measures an aspect of a service’s reliability — for example, the proportion of successful HTTP requests, or the fraction of requests served within 200ms.
“Our primary SLI for the checkout service is the success rate of payment requests.”
“We track three SLIs: availability, latency, and error rate.”
SLO (Service Level Objective)
An SLO is the target value or range for an SLI — the reliability goal your team commits to.
“Our SLO is 99.9% availability, measured over a rolling 30-day window.”
“We missed our SLO for the third consecutive month, which has triggered a reliability review.”
SLA (Service Level Agreement)
An SLA is a formal contract with a customer that defines the expected service level and the consequences of missing it. Unlike an SLO (internal target), an SLA has legal and financial implications.
“Our enterprise customers have an SLA that guarantees 99.95% uptime. Breaching it triggers a service credit.”
Key distinction to remember:
- SLI = what you measure
- SLO = what you target
- SLA = what you promise to customers
Error Budgets
Error Budget
An error budget is the amount of downtime or errors you are permitted to have while still meeting your SLO. If your SLO is 99.9% availability, your error budget is 0.1% — about 43 minutes of downtime per month.
“We’ve consumed 60% of our error budget this month — we should slow down risky deployments.”
“The error budget gives us a data-driven way to decide how much risk to take with new releases.”
Error Budget Burn Rate
The burn rate describes how quickly you are consuming your error budget relative to the expected pace.
“We’re burning our error budget at three times the expected rate — if this continues, we’ll exhaust it in ten days.”
Operational Concepts
Toil
Toil is manual, repetitive, automatable operational work that does not add lasting value. Reducing toil is a core SRE principle.
“Manually restarting services after every deployment is pure toil — we need to automate this.”
“The SRE team tracks toil hours each quarter and aims to keep it below 50% of engineering time.”
Runbook
A runbook is a documented procedure for handling a specific operational scenario — typically an incident or routine task.
“When the payment service latency exceeds 2 seconds, follow the runbook in Confluence to diagnose and resolve it.”
“We’re automating parts of the runbook so on-call engineers spend less time on repetitive diagnostics.”
Playbook
A playbook is similar to a runbook but typically covers a broader range of scenarios or a higher-level incident response strategy.
“The security incident playbook defines the escalation path and communication protocol for data breaches.”
Incidents and Reviews
Blameless Postmortem
A blameless postmortem is a structured review of an incident that focuses on systemic causes rather than individual mistakes. The goal is to learn and improve, not to assign blame.
“After the outage, we ran a blameless postmortem and identified three process gaps that contributed to the incident.”
“Blameless culture is essential — people won’t report near-misses if they fear punishment.”
Mean Time to Recovery (MTTR)
MTTR is the average time it takes to restore service after an incident.
“Our MTTR for P1 incidents is currently 47 minutes — we’re targeting sub-30 minutes by end of quarter.”
Mean Time Between Failures (MTBF)
MTBF is the average time between incidents. A higher MTBF indicates greater reliability.
“Increasing MTBF requires eliminating recurring failure patterns, not just faster recovery.”
Architecture and Capacity
Chaos Engineering
Chaos engineering is the practice of deliberately introducing failures into a system to test its resilience.
“We run chaos experiments in staging to verify that our circuit breakers behave correctly under failure conditions.”
Capacity Planning
Capacity planning is forecasting future resource needs based on growth projections and current usage.
“Based on our traffic growth rate, we’ll need to double database capacity by Q4 — capacity planning should start now.”
On-call
Being on-call means being available outside normal hours to respond to incidents.
“I’m on-call this week — my response SLA for P1 incidents is 15 minutes.”
Key Vocabulary Reference
| Term | Definition |
|---|---|
| SLI | The metric you measure (e.g., success rate) |
| SLO | The target for that metric (e.g., 99.9%) |
| SLA | The customer-facing commitment |
| Error budget | Allowable downtime within the SLO |
| Toil | Manual, repetitive, automatable work |
| Runbook | Step-by-step procedure for a specific scenario |
| Blameless postmortem | Incident review focused on systemic causes |
| MTTR | Average time to recover from an incident |
SRE vocabulary is precise by design — these terms carry specific meanings that enable exact communication about reliability. Learning to use them accurately will help you contribute to SRE discussions, write clearer incident reports, and work more effectively with reliability-focused teams.