Advanced Vocabulary #slo#sre#reliability#error-budget

SLOs, Error Budgets & Reliability Vocabulary

5 exercises — Practice SLI/SLO/SLA and reliability vocabulary in English: error budgets, burn rate, toil, on-call language, and production readiness.

Core SLO & Reliability vocabulary clusters

SLI/SLO/SLA: Service Level Indicator, Service Level Objective, Service Level Agreement, error budget, availability target
Error budget: budget remaining, burn rate, fast burn, slow burn, budget exhaustion, budget policy
On-call: on-call rotation, escalation policy, runbook, playbook, alert fatigue, pager load, handoff
Toil: toil, O(n) growth, automation, toil budget, engineering work vs. toil
Production readiness: PRR (Production Readiness Review), go/no-go, launch readiness, capacity planning, rollout plan

0 / 5 completed

1 / 5

An SRE team lead introduces reliability concepts:
"An SLI is a quantitative measure of reliability — like request success rate. An SLO is the target we set for that SLI — 99.9% success rate over 30 days. The SLA is the contractual commitment with customers — often less strict than the SLO, with financial penalties for breach. The difference between SLO and SLA gives us internal buffer."
What is the relationship between SLI, SLO, and SLA?

2 / 5

An SRE engineer explains error budgets to the product team:
"Our SLO is 99.9% availability. That means we can have 0.1% bad time — that's our error budget. Over 30 days, 0.1% is about 43 minutes. If we've already used 35 minutes of downtime, we have 8 minutes left. If we burn through the entire budget before the month ends, our error budget policy kicks in: we freeze non-critical releases until the budget recovers."
What is an error budget and how does it influence release decisions?

3 / 5

An SRE explains burn rate alerting:
"We alert on burn rate, not just raw SLI values. A burn rate of 1.0 means we're consuming the error budget at exactly the rate it replenishes — we'll exactly exhaust it by end of the month. A burn rate of 6 means we're consuming it 6× faster — we'll be out in 5 days. We alert at burn rate 14.4 for 1-hour windows and burn rate 6 for 6-hour windows — these catch different failure modes."
Why is alerting on burn rate better than alerting on raw error rate?

4 / 5

An SRE lead reviews on-call health with the team:
"Our on-call is unsustainable. Pager load is 25 pages per week per person — industry best practice is under 5. Most pages are for things that can't be fixed at 3am. We have a toil problem: 60% of our time is responding to the same recurring issues. We need to invest in automation and runbook improvement. The goal is to make on-call boring — rare pages that are always actionable."
What does alert fatigue mean and why is it dangerous for reliability?

5 / 5

An engineering team prepares for a launch:
"Before we launch, we do a PRR — Production Readiness Review. We go through a checklist: SLOs defined, alerting configured, runbooks written, load testing done, rollback plan documented, capacity estimated. The PRR output is a go/no-go decision. If we're not ready on any critical item, we don't launch — or we accept the risk explicitly in writing."
What is a Production Readiness Review (PRR) and what does the go/no-go decision involve?