5 exercises on describing reliability targets, error budgets, and burn rates in professional English.
The reliability vocabulary trio
SLI: the measurement — e.g. request success rate
SLO: the internal target — e.g. 99.9% success rate over 28 days
SLA: the external contract — with financial consequences for breach
Error budget: 100% minus the SLO target — the allowed failure window
0 / 5 completed
1 / 5
An SRE says: "We're burning through our error budget at 3x the normal rate." What does this mean?
Error budget burn rate — what it means
The error budget is the allowed amount of unreliability within the SLO window (usually 28 or 30 days). A burn rate of 3x means:
The service is consuming its error budget 3 times faster than the window allows
If the budget covers 43.8 minutes of downtime per month (99.9% SLO), a 3x burn rate would exhaust it in about 10 days
Why this triggers action: Teams typically set alert thresholds at burn rates of 1x (on-pace) and higher (e.g. 2x = 14-day exhaustion, 6x = 5-day exhaustion). A 3x burn rate is a serious signal requiring investigation and possibly a deployment hold.
Useful phrases:
"We're burning through the error budget at [N]x the expected rate."
"At this burn rate, we'll exhaust the budget in [N] days."
"The burn rate alert fired — we need to investigate the reliability regression."
2 / 5
Which sentence correctly uses SLO vocabulary in a professional context?
What a well-formed SLO statement contains
A professional SLO statement specifies:
Service: "checkout API" — not vague
Target: "99.95%" — specific number
Dimension: "availability" — what is being measured
Window: "rolling 28-day" — the measurement period
Measurement definition: "ratio of successful requests to total" — how success is defined
Why A is wrong: 100% SLO is an unachievable target that leaves no error budget and creates false expectations. Even the most reliable cloud services target 99.99%, not 100%.
Why C is an SLA, not an SLO: SLAs have customer-facing consequences (refunds, credits). SLOs are internal targets.
Vocabulary note: SLOs are internal. SLAs are external contracts. Confusing them is a common and costly mistake.
3 / 5
A post-mortem reads: "The incident caused 47 minutes of degraded availability, consuming 107% of our monthly error budget." What does this mean?
Error budget exhaustion in a single incident
This sentence means the team's total monthly error budget was smaller than 47 minutes, and this one incident exceeded it entirely:
The error budget is now fully exhausted — and then some
Consequences: Most SRE-mature teams respond to budget exhaustion by:
Halting non-critical feature deployments
Prioritising reliability work over features
Reviewing why the incident consumed more time than the budget
Useful post-mortem phrases:
"This incident consumed [X%] of our [monthly/quarterly] error budget."
"The error budget is now exhausted for this window."
"We need to hold deployments until the budget resets on [date]."
4 / 5
How would you professionally describe this situation: your service has been above its SLO for 3 consecutive months?
What exceeding the SLO means — and why it matters for decision-making
"Above the SLO" means the actual reliability is better than the target. This is good news, but has an important implication in SRE practice:
Healthy error budget = room to ship faster If you've been above the SLO for three months, your error budget is largely unspent. This signals:
The service is over-engineered for the current target
The team can afford to take more deployment risk (ship features faster)
Alternatively: the SLO could be tightened to raise the reliability standard
Key phrase: "reliability margin" — the gap between your actual reliability and your SLO target.
Contextual vocabulary:
"We've maintained a healthy reliability margin this quarter."
"The error budget headroom allows us to increase deployment frequency."
"Our SLO compliance rate is [X%] — above target."
5 / 5
A colleague asks: "What's our MTTR this month?" You know the team had 3 incidents lasting 12, 8, and 4 minutes. What is the correct answer and how do you phrase it?
MTTR — Mean Time to Recovery (or Resolution)
MTTR = the average time to recover from an incident, calculated over a set of incidents.