Intermediate
Estimation Language
#sla
#reliability
Reading SLAs & SLOs
6 exercises — interpret uptime percentages, error budgets, percentile latency targets (P99), and SLA breach language for engineers and stakeholders.
0 / 6 completed
1 / 6
An SLA states "99.9% uptime per month". Approximately how much downtime does this permit per month?
43 minutes — 99.9% uptime = 0.1% downtime per month.
Calculation:
• 1 month ≈ 30 days × 24 hours × 60 minutes = 43,200 minutes
• 0.1% of 43,200 = 43.2 minutes
Uptime → downtime reference table (per month):
Saying it in English:
• "99.9% uptime — that's three nines — permits about 43 minutes of downtime per month"
• "We're on a four-nine SLA, so we get fewer than 5 minutes downtime per month"
• "Each additional nine reduces your downtime budget by a factor of ten"
Calculation:
• 1 month ≈ 30 days × 24 hours × 60 minutes = 43,200 minutes
• 0.1% of 43,200 = 43.2 minutes
Uptime → downtime reference table (per month):
| SLA | Nines | Monthly downtime | Annual downtime |
|---|---|---|---|
| 99% | Two 9s | ~7.2 hours | ~87.6 hours |
| 99.9% | Three 9s | ~43 min | ~8.7 hours |
| 99.95% | Three-and-a-half 9s | ~21 min | ~4.4 hours |
| 99.99% | Four 9s | ~4.3 min | ~52 min |
| 99.999% | Five 9s | ~26 sec | ~5 min |
Saying it in English:
• "99.9% uptime — that's three nines — permits about 43 minutes of downtime per month"
• "We're on a four-nine SLA, so we get fewer than 5 minutes downtime per month"
• "Each additional nine reduces your downtime budget by a factor of ten"
2 / 6
What is the difference between an SLA, an SLO, and an SLI?
SLA / SLO / SLI — three layers of reliability measurement.
Definitions:
• SLI (Service Level Indicator) — the actual metric measured: "our API error rate is currently 0.3%"
• SLO (Service Level Objective) — the internal target: "we aim to keep the error rate below 0.5%"
• SLA (Service Level Agreement) — the external contract: "we promise customers 99.9% uptime; if we miss it, we issue credits"
Relationship:
Typical SLO is stricter than the SLA (internal target vs. external commitment):
• SLA: 99.9% uptime (customer promise)
• SLO: 99.95% uptime (internal target, with buffer)
• SLI: measured every minute from synthetic monitoring
How to use in conversation:
• "Are we within SLO?" — are we meeting our internal target?
• "We've breached the SLA" — we've violated the customer contract; credits may be due
• "The SLI is trending down — we may breach SLO by end of week"
• "What's the SLO for checkout latency?" — asking for the threshold
Error budget:
• Error budget = 100% − SLO = the allowed "bad time"
• "We've consumed 60% of our monthly error budget" — you have 40% left before SLO breach
Definitions:
• SLI (Service Level Indicator) — the actual metric measured: "our API error rate is currently 0.3%"
• SLO (Service Level Objective) — the internal target: "we aim to keep the error rate below 0.5%"
• SLA (Service Level Agreement) — the external contract: "we promise customers 99.9% uptime; if we miss it, we issue credits"
Relationship:
SLI (what we measure) → SLO (what we target) → SLA (what we promise)
Typical SLO is stricter than the SLA (internal target vs. external commitment):
• SLA: 99.9% uptime (customer promise)
• SLO: 99.95% uptime (internal target, with buffer)
• SLI: measured every minute from synthetic monitoring
How to use in conversation:
• "Are we within SLO?" — are we meeting our internal target?
• "We've breached the SLA" — we've violated the customer contract; credits may be due
• "The SLI is trending down — we may breach SLO by end of week"
• "What's the SLO for checkout latency?" — asking for the threshold
Error budget:
• Error budget = 100% − SLO = the allowed "bad time"
• "We've consumed 60% of our monthly error budget" — you have 40% left before SLO breach
3 / 6
A service has consumed 80% of its monthly error budget by day 20. What does this mean, and what would you say?
Option C — reads the error budget status correctly and draws an operational conclusion.
Error budget interpretation:
• Error budget = 100% − SLO uptime target
• For 99.9% SLO: error budget = 0.1% of month = ~43 minutes
• Consumed 80% of budget = used ~34 of those 43 minutes
• With 10 days left in the month, 80% consumed in 20 days → on track to double the allowed budget
Error budget language:
• "We've consumed X% of our error budget" — standard phrase
• "We're burning through our error budget" — informal, warns of fast consumption
• "Our error budget is exhausted" — 100% consumed; now in SLO breach
• "We have [X] minutes of error budget remaining" — quantified
• "Feature freeze until end of month" — a standard response to error budget depletion
• "error budget policy" — a team agreement about what happens when the budget is consumed
Why this matters:
The error budget creates a shared language between engineering and product. When it's high, teams can ship aggressively. When it's low, risky changes are frozen. This makes reliability a business conversation, not only a technical one.
Error budget interpretation:
• Error budget = 100% − SLO uptime target
• For 99.9% SLO: error budget = 0.1% of month = ~43 minutes
• Consumed 80% of budget = used ~34 of those 43 minutes
• With 10 days left in the month, 80% consumed in 20 days → on track to double the allowed budget
Error budget language:
• "We've consumed X% of our error budget" — standard phrase
• "We're burning through our error budget" — informal, warns of fast consumption
• "Our error budget is exhausted" — 100% consumed; now in SLO breach
• "We have [X] minutes of error budget remaining" — quantified
• "Feature freeze until end of month" — a standard response to error budget depletion
• "error budget policy" — a team agreement about what happens when the budget is consumed
Why this matters:
The error budget creates a shared language between engineering and product. When it's high, teams can ship aggressively. When it's low, risky changes are frozen. This makes reliability a business conversation, not only a technical one.
4 / 6
An SLA clause reads: "The latency SLO for the search API is P99 < 500ms." What does P99 mean, and how would you describe it?
P99 — the 99th percentile latency: 99% of requests complete faster than this threshold.
Percentile vocabulary:
• P50 (median) — 50% of requests are faster; the "typical" user experience
• P95 — 95% of requests are faster; fast users see this
• P99 — 99% of requests are faster; only 1 in 100 are slower
• P99.9 — 99.9% of requests are faster; 1 in 1,000 exceed this
• "tail latency" — informal term for P99 and above
How to describe SLOs in English:
• "Our P99 is 500ms — one in a hundred requests can take up to 500ms or more"
• "P50 is 80ms, but the tail latency (P99) is much higher at 1.2 seconds"
• "We need to bring the tail latency down — P99 is 4× our target"
• "The median is fine, but P95 and P99 are outliers suggesting specific slow paths"
Why P99 matters more than average:
At large scale, even 1% of requests = millions of bad user experiences. Optimising the average can hide severe tail latency. SLOs are typically set on P95 or P99 for APIs, and P99.9 for payment/critical paths. "Tail latency" is the enemy of consistent user experience.
Percentile vocabulary:
• P50 (median) — 50% of requests are faster; the "typical" user experience
• P95 — 95% of requests are faster; fast users see this
• P99 — 99% of requests are faster; only 1 in 100 are slower
• P99.9 — 99.9% of requests are faster; 1 in 1,000 exceed this
• "tail latency" — informal term for P99 and above
How to describe SLOs in English:
• "Our P99 is 500ms — one in a hundred requests can take up to 500ms or more"
• "P50 is 80ms, but the tail latency (P99) is much higher at 1.2 seconds"
• "We need to bring the tail latency down — P99 is 4× our target"
• "The median is fine, but P95 and P99 are outliers suggesting specific slow paths"
Why P99 matters more than average:
At large scale, even 1% of requests = millions of bad user experiences. Optimising the average can hide severe tail latency. SLOs are typically set on P95 or P99 for APIs, and P99.9 for payment/critical paths. "Tail latency" is the enemy of consistent user experience.
5 / 6
Your monitoring shows the service is at 99.94% availability over the past 30 days. Your SLO is 99.9%. How do you describe this status?
Option C — confirms SLO compliance, quantifies the margin, and translates the availability percentage into remaining error budget minutes.
Status report structure:
Calculation:
• 99.9% SLO = 43 min/month error budget
• Actual: 99.94% → downtime = 0.06% of 43,200 min = ~26 min used
• Remaining: 43 − 26 = ~17 minutes of error budget
Note on Option D: 99.94% is "three-and-a-half nines" (99.9X%), not six nines. Six nines would be 99.9999%.
Hedging SLO status language:
• "We're within SLO" — meeting the target
• "We're tracking toward an SLO breach" — moving in the wrong direction
• "We're in breach of SLO" — already failed the target
• "We have X minutes of error budget remaining" — quantified headroom
• "We're comfortably within SLO" — large margin
• "We're on the edge of our SLO" — close to the limit
• "We burned through our error budget in [X days]" — retrospective on a breach
Status report structure:
[Current measurement] against [SLO target] → [margin/headroom] Translated to: [X minutes used] of [Y minute budget] → [Z minutes remaining]
Calculation:
• 99.9% SLO = 43 min/month error budget
• Actual: 99.94% → downtime = 0.06% of 43,200 min = ~26 min used
• Remaining: 43 − 26 = ~17 minutes of error budget
Note on Option D: 99.94% is "three-and-a-half nines" (99.9X%), not six nines. Six nines would be 99.9999%.
Hedging SLO status language:
• "We're within SLO" — meeting the target
• "We're tracking toward an SLO breach" — moving in the wrong direction
• "We're in breach of SLO" — already failed the target
• "We have X minutes of error budget remaining" — quantified headroom
• "We're comfortably within SLO" — large margin
• "We're on the edge of our SLO" — close to the limit
• "We burned through our error budget in [X days]" — retrospective on a breach
6 / 6
How do you clearly explain an SLA penalty to a non-technical stakeholder?
Option B — explains the breach mechanism, quantifies the penalty in business terms, and confirms the notification obligation was met.
SLA breach communication structure:
• What happened: "uptime dropped below the contractual threshold" — specific but jargon-light
• Consequence: "triggering an automatic service credit" — describes the SLA mechanism
• Quantified impact: "10% of the monthly fee — approximately $2,400" — translates to business terms
• Contractual obligation met: "notified within the required 72-hour window" — shows compliance
SLA violation vocabulary:
• "SLA breach" / "SLA violation" — the event
• "service credit" — the typical remedy (percentage of monthly fee returned)
• "contractual threshold" — the agreed uptime level
• "downtime allowance" — the permitted downtime period
• "remediation period" — the time allowed to fix a breach before penalties apply
• "cure period" — legal term for the same
• "notification obligation" — contractual requirement to inform the customer within X hours
Distinguishing SLO vs. SLA breach:
• Internal SLO breach: "We missed our internal reliability target — no customer commitments violated, but we need to investigate"
• External SLA breach: "We violated a customer contract — credits are due, and we must generate an incident report"
SLA breach communication structure:
• What happened: "uptime dropped below the contractual threshold" — specific but jargon-light
• Consequence: "triggering an automatic service credit" — describes the SLA mechanism
• Quantified impact: "10% of the monthly fee — approximately $2,400" — translates to business terms
• Contractual obligation met: "notified within the required 72-hour window" — shows compliance
SLA violation vocabulary:
• "SLA breach" / "SLA violation" — the event
• "service credit" — the typical remedy (percentage of monthly fee returned)
• "contractual threshold" — the agreed uptime level
• "downtime allowance" — the permitted downtime period
• "remediation period" — the time allowed to fix a breach before penalties apply
• "cure period" — legal term for the same
• "notification obligation" — contractual requirement to inform the customer within X hours
Distinguishing SLO vs. SLA breach:
• Internal SLO breach: "We missed our internal reliability target — no customer commitments violated, but we need to investigate"
• External SLA breach: "We violated a customer contract — credits are due, and we must generate an incident report"