How to Write an SLO Document in English
Learn the English structure and phrasing for writing a service level objective document, covering the SLI, target, error budget, and consequences of breach.
An SLO document that just says “99.9% uptime” doesn’t tell anyone how that’s measured, what happens when it’s missed, or who’s accountable for it — this guide covers the structure that makes an SLO document a working reference instead of a number nobody trusts.
Key Vocabulary
SLI (service level indicator) — the specific, measurable metric an SLO is built on, such as the percentage of requests completed within a latency threshold, defined precisely enough that two people would calculate the same number from the same data. “Before we write the target, we need to agree on the SLI itself — is it the percentage of successful HTTP responses, or successful responses within 500ms? Those are different metrics and would produce different numbers from the same traffic.”
Target — the specific numeric threshold the SLI needs to meet over a defined time window, such as 99.9% over a rolling 30 days, chosen based on what users actually need rather than what feels aspirational. “We set the target at 99.9%, not 99.99%, because our own data shows users don’t notice the difference in practice, and the tighter target would cost significantly more in on-call burden for no real user benefit.”
Error budget — the amount of allowable failure implied by the target, such as roughly 43 minutes of downtime per month at 99.9%, framed as a budget the team can spend on risk, not just a violation to avoid. “We still have error budget left this month, so shipping this riskier change is a reasonable call — if we’d already burned through it, I’d want to hold off until next month’s budget resets.”
Consequence of breach — the specific, agreed action that happens when the error budget is exhausted, such as pausing feature launches in favor of reliability work, written into the document so it isn’t negotiated after the fact. “The document needs a consequence of breach section — right now if we blow through the error budget, there’s no agreed response, which means it’ll just get argued about in the moment instead of being a known process.”
Common Phrases
- “What’s the actual SLI here, and is it something we can measure consistently?”
- “Is this target based on user impact data, or is it an aspirational number?”
- “How much error budget do we have left this period?”
- “What’s the agreed consequence if we breach the error budget?”
- “Does this SLO cover the user-facing behavior that actually matters, or an internal proxy for it?”
Example Sentences
Defining the SLI precisely: “SLI: the percentage of API requests that return within 400ms, measured at the load balancer, excluding requests to internal health check endpoints. We’re measuring at the load balancer specifically because that’s closest to what the end user actually experiences.”
Framing the error budget as a decision tool: “With 60% of this month’s error budget remaining, I’m comfortable approving the migration this week. If we were down to single digits, I’d want to push it to next month rather than risk exhausting the budget on a routine change.”
Writing the consequence of breach: “Consequence of breach: if the error budget is exhausted before the period ends, feature launches for this service are paused, and the team’s next sprint prioritizes reliability work until the SLO is back within target.”
Professional Tips
- Define the SLI with enough precision that two engineers would calculate the same number independently — an ambiguous SLI makes the target essentially meaningless.
- Set the target based on actual user tolerance data where possible, not a round number that sounds impressive — a target set too aggressively creates constant, unnecessary on-call pressure.
- Frame the error budget as a spendable resource for the team, not just a compliance metric — it reframes reliability conversations from “did we fail” to “how much risk can we still afford this period.”
- Write the consequence of breach into the document itself, agreed on in advance — deciding the response to a breach in the middle of a breach tends to produce a worse, more political outcome than deciding it calmly ahead of time.
Practice Exercise
- Write an SLI definition precise enough that a colleague could calculate it from raw data.
- Explain how an error budget changes the way a team makes risk decisions.
- Write a consequence of breach clause for a hypothetical service.