5 exercises — practice the "nines of availability": 99.9% = 8.76 hours/year downtime, SLA calculations, error budgets, and what achieving "five nines" actually requires.
"Error budget = (1 - SLA) × time period; once exhausted, freeze risky changes."
1 / 5
A service has "99.9% uptime." How much downtime does this allow per year?
Option B is the correct calculation. The "nines" of availability are one of the most important vocabulary items in SRE and cloud computing: (1) 99% = 2 nines = ~3.65 days/year downtime, (2) 99.9% = 3 nines = ~8.76 hours/year, (3) 99.99% = 4 nines = ~52.6 minutes/year, (4) 99.999% = 5 nines = ~5.26 minutes/year. The formula: downtime = (1 - availability) × (hours in period). For 99.9% annual: (1 - 0.999) × 8,760 = 8.76 hours. Option A (~9 minutes) is the 99.999% figure. Option C (1 hour) is close to 99.99%. Option D describes a per-day rate: 1 minute/day × 365 ≈ 6 hours/year. Memorising the "nines" table is essential for technical SLA conversations.
2 / 5
A client requires "four nines availability" in an SLA. What does this mean, and what's the maximum downtime per month?
Option B is correct. "Four nines" = four 9s = 99.99%. Monthly calculation: (1 - 0.9999) × 720 = 0.072 hours = 4.32 minutes. The qualification "demanding SLA" is important: achieving 99.99% requires: (1) redundant infrastructure (no single point of failure), (2) automated failover (human response time of minutes would already breach the SLA during an incident), (3) zero-downtime deployments (maintenance windows cannot be used; you'd use rolling deployments), (4) multi-region or multi-AZ setup for infrastructure-level failures. Option A misreads "four nines" as "99 point 4." Option C gets the number roughly right but describes it differently. Option D has no basis.
3 / 5
An SRE says "we have an error budget of 43 minutes this month." What is the implied SLA, and how should this budget be managed?
Option B correctly defines error budgets in SRE practice. Key components: (1) calculation: 43 minutes ≈ (1 - 0.9999) × 43,200 minutes/month = 4.32 minutes per month — this is actually closer to 99.99% at ~4 minutes, or 43 minutes would imply ~99.9% (8.76 hours/year ÷ 12 months = ~44 minutes/month). Either way, the concept is correct, (2) exhaustion policy: Google's SRE model says when the error budget is exhausted, development is slowed to prevent further reliability risk, (3) velocity alignment: this is the key insight — error budgets make reliability a shared business concern between SRE and development teams, not just an ops problem. Option A confuses error budgets with maintenance windows — they are distinct. Option C treats the budget as a single-use allowance. Option D defines MTTR (mean time to repair), not error budgets.
4 / 5
A client says "we need five nines." An architect responds "that's extremely difficult to achieve." Which explanation best justifies this?
Option B justifies the difficulty with specific, accurate technical reasons. The key points: (1) 5.26 minutes/year — this is less time than a single typical incident response, meaning automated recovery is mandatory (humans can't even respond and resolve in 5 minutes reliably), (2) multi-region active-active — single datacenter failures would breach five nines, (3) zero-downtime deployments — even one deployment per week with a 30-second restart would consume the entire annual budget in 3 deployments, (4) non-linear cost — the jump from 99.99% to 99.999% typically requires a complete architecture redesign, not incremental improvement. Option A is wrong — five nines exists (some financial exchanges target it), but it's very rare and expensive. Option C invents a "5 servers = 5 nines" rule that has no basis. Option D dismisses it as a marketing term.
5 / 5
How do you correctly express "we had 99.95% uptime last month" in terms of actual downtime?
Option B provides the complete, accurate conversion. The calculation: 720 hours × (1 - 0.9995) = 720 × 0.0005 = 0.36 hours = 21.6 minutes. Contextualising this on the nines scale (between 99.9% and 99.99%) is valuable: it helps stakeholders understand where 99.95% sits relative to common SLA tiers. This is how SRE reports are written — state the SLA percentage, calculate the actual downtime, and contextualise on the nines scale. Option A confuses uptime (availability) with error rate — uptime refers to service availability (was the service up or down), not request success rate, though they are related. Option C describes the actual outage pattern without the calculation — incomplete. Option D makes a major error: 99.95% uptime means 0.05% downtime (not 5%), which is 21.6 minutes/month, not 5% of the month (~36 hours).