Vocabulary for Site Reliability Engineers

Site Reliability Engineering (SRE) has its own distinct vocabulary — a mix of traditional operations terminology, Google-originated concepts, and software engineering language. For non-native English speakers moving into SRE roles or working alongside SRE teams, mastering this vocabulary is essential for both daily communication and career growth.

This guide covers the core SRE vocabulary with clear definitions, usage examples, and notes on how each term is used in practice.

Service Level Concepts

SLI (Service Level Indicator)

An SLI is a specific metric that measures an aspect of a service’s reliability — for example, the proportion of successful HTTP requests, or the fraction of requests served within 200ms.

“Our primary SLI for the checkout service is the success rate of payment requests.”

“We track three SLIs: availability, latency, and error rate.”

SLO (Service Level Objective)

An SLO is the target value or range for an SLI — the reliability goal your team commits to.

“Our SLO is 99.9% availability, measured over a rolling 30-day window.”

“We missed our SLO for the third consecutive month, which has triggered a reliability review.”

SLA (Service Level Agreement)

An SLA is a formal contract with a customer that defines the expected service level and the consequences of missing it. Unlike an SLO (internal target), an SLA has legal and financial implications.

“Our enterprise customers have an SLA that guarantees 99.95% uptime. Breaching it triggers a service credit.”

Key distinction to remember:

SLI = what you measure
SLO = what you target
SLA = what you promise to customers

Error Budgets

Error Budget

An error budget is the amount of downtime or errors you are permitted to have while still meeting your SLO. If your SLO is 99.9% availability, your error budget is 0.1% — about 43 minutes of downtime per month.

“We’ve consumed 60% of our error budget this month — we should slow down risky deployments.”

“The error budget gives us a data-driven way to decide how much risk to take with new releases.”

Error Budget Burn Rate

The burn rate describes how quickly you are consuming your error budget relative to the expected pace.

“We’re burning our error budget at three times the expected rate — if this continues, we’ll exhaust it in ten days.”

Operational Concepts

Toil

Toil is manual, repetitive, automatable operational work that does not add lasting value. Reducing toil is a core SRE principle.

“Manually restarting services after every deployment is pure toil — we need to automate this.”

“The SRE team tracks toil hours each quarter and aims to keep it below 50% of engineering time.”

Runbook

A runbook is a documented procedure for handling a specific operational scenario — typically an incident or routine task.

“When the payment service latency exceeds 2 seconds, follow the runbook in Confluence to diagnose and resolve it.”

“We’re automating parts of the runbook so on-call engineers spend less time on repetitive diagnostics.”

Playbook

A playbook is similar to a runbook but typically covers a broader range of scenarios or a higher-level incident response strategy.

“The security incident playbook defines the escalation path and communication protocol for data breaches.”

Incidents and Reviews

Blameless Postmortem

A blameless postmortem is a structured review of an incident that focuses on systemic causes rather than individual mistakes. The goal is to learn and improve, not to assign blame.

“After the outage, we ran a blameless postmortem and identified three process gaps that contributed to the incident.”

“Blameless culture is essential — people won’t report near-misses if they fear punishment.”

Mean Time to Recovery (MTTR)

MTTR is the average time it takes to restore service after an incident.

“Our MTTR for P1 incidents is currently 47 minutes — we’re targeting sub-30 minutes by end of quarter.”

Mean Time Between Failures (MTBF)

MTBF is the average time between incidents. A higher MTBF indicates greater reliability.

“Increasing MTBF requires eliminating recurring failure patterns, not just faster recovery.”

Architecture and Capacity

Chaos Engineering

Chaos engineering is the practice of deliberately introducing failures into a system to test its resilience.

“We run chaos experiments in staging to verify that our circuit breakers behave correctly under failure conditions.”

Capacity Planning

Capacity planning is forecasting future resource needs based on growth projections and current usage.

“Based on our traffic growth rate, we’ll need to double database capacity by Q4 — capacity planning should start now.”

On-call

Being on-call means being available outside normal hours to respond to incidents.

“I’m on-call this week — my response SLA for P1 incidents is 15 minutes.”

Key Vocabulary Reference

Term	Definition
SLI	The metric you measure (e.g., success rate)
SLO	The target for that metric (e.g., 99.9%)
SLA	The customer-facing commitment
Error budget	Allowable downtime within the SLO
Toil	Manual, repetitive, automatable work
Runbook	Step-by-step procedure for a specific scenario
Blameless postmortem	Incident review focused on systemic causes
MTTR	Average time to recover from an incident

SRE vocabulary is precise by design — these terms carry specific meanings that enable exact communication about reliability. Learning to use them accurately will help you contribute to SRE discussions, write clearer incident reports, and work more effectively with reliability-focused teams.

Vocabulary for Site Reliability Engineers

Service Level Concepts

SLI (Service Level Indicator)

SLO (Service Level Objective)

SLA (Service Level Agreement)

Error Budgets

Error Budget

Error Budget Burn Rate

Operational Concepts

Toil

Runbook

Playbook

Incidents and Reviews

Blameless Postmortem

Mean Time to Recovery (MTTR)

Mean Time Between Failures (MTBF)

Architecture and Capacity

Chaos Engineering

Capacity Planning

On-call

Key Vocabulary Reference

What to Read Next

Practice This Vocabulary

IT Collocations Drills

Interview Preparation

IT Vocabulary Modules

Service Level Concepts

SLI (Service Level Indicator)

SLO (Service Level Objective)

SLA (Service Level Agreement)

Error Budgets

Error Budget

Error Budget Burn Rate

Operational Concepts

Toil

Runbook

Playbook

Incidents and Reviews

Blameless Postmortem

Mean Time to Recovery (MTTR)

Mean Time Between Failures (MTBF)

Architecture and Capacity

Chaos Engineering

Capacity Planning

On-call

Key Vocabulary Reference

Related Articles

What to Read Next

Practice This Vocabulary

IT Collocations Drills

Interview Preparation

IT Vocabulary Modules