Per month: 0.001 × 43,800 min = 43.8 minutes/month
Saying it: "Three nines" (99.9%) is considered standard for many web services. It sounds impressive but allows nearly 9 hours of downtime per year — likely more than users expect.
Common anchor point for discussions: "Three nines is often the baseline SLA for non-critical services. For payment systems or healthcare, we expect at least four nines."
2 / 5
What is meant by "five nines", and how much downtime does it permit per year?
"Five nines" — 99.999% uptime "Five nines" is a shorthand used in SRE and SLA negotiations for the highest practical tier of availability.
The math:
99.999% = 0.001% maximum downtime
0.00001 × 525,600 min/year = 5.26 minutes/year
Per month: ~26 seconds
The "nines" naming convention — count the 9s after the decimal:
Name
Uptime
Downtime/year
Two nines
99%
~87.6 hours
Three nines
99.9%
~8.76 hours
Four nines
99.99%
~52.6 min
Five nines
99.999%
~5.26 min
Realistic note: Five nines requires extreme engineering effort and cost. Even AWS and Azure major services target 99.99%. True five nines is rare and expensive.
3 / 5
An SRE says: "We've used 80% of our error budget this month and it's only the 20th." What is an error budget, and why is this a concern?
Error Budget — the SRE reliability concept Introduced by Google's Site Reliability Engineering (SRE) discipline, an error budget is the maximum allowed unreliability derived directly from the SLA.
Formula: Error budget = 100% − SLA target
SLA = 99.9% uptime → Error budget = 0.1% of time per month ≈ 43.8 minutes/month
SLA = 99.99% → Error budget ≈ 4.38 minutes/month
Why it matters: The error budget serves as a data-driven negotiation tool between product and engineering: • If error budget is healthy (not consumed): Ship new features aggressively. Risk is acceptable. • If error budget is nearly exhausted: Slow down feature deployments, prioritize reliability work. • If error budget is breached: No new feature releases until next period; focus only on stability.
In the question: 80% of the budget consumed on day 20 of 30 means the team is on pace to breach the SLA. This is a meaningful operational alert. The SRE should escalate and potentially hold deploys.
4 / 5
A team is debating their SLA tier for a new payment processing service. The options are 99%, 99.9%, or 99.99% uptime. Which should they target, and why?
Choosing the right SLA tier for critical services Not all services require the same availability. The correct SLA depends on business impact per minute of downtime.
99.9% (three nines): ~8.76 hours/year. Acceptable for: blogs, marketing sites, read-only documentation. Many SaaS services offer this as their base tier.
99.99% (four nines): ~52.6 minutes/year. Expected for: payment processing, banking APIs, e-commerce checkout, critical healthcare systems. Every minute of payment downtime = direct revenue loss + customer trust damage.
100% uptime is not a real, achievable target in engineering — hardware fails, networks partition, deployments require restarts. Engineers should never promise 100% uptime. Even "100% SLA" commitments in contracts typically include exclusions.
Practical vocabulary for SLA discussions: "What's our availability target?" / "What's our SLO?" / "What's our error budget?" / "Can we meet four nines on the checkout path?" / "We need at minimum three nines for the reporting API."
5 / 5
An incident report states: "The service exceeded its SLO, triggering an SLA breach. The team must now review SLIs." What do SLI, SLO, and SLA stand for, and how do they relate?
SLI / SLO / SLA — the three-tier reliability vocabulary These three terms form the core vocabulary of SRE and reliability engineering. Getting them right in interviews and incident reviews signals professional competence.
SLI — Service Level Indicator: A measurement of service behavior. A specific, quantifiable metric. Examples: request success rate (%), API latency at P99 (ms), error rate (%), page load time. "Our SLIs are P99 latency and 5xx error rate."
SLO — Service Level Objective: An internal target for an SLI. What the team promises itself to achieve. Examples: "P99 latency < 200ms", "error rate < 0.1%". SLOs are tighter than the SLA — a buffer exists so you exceed the SLO before breaching the SLA.
SLA — Service Level Agreement: A contract between a service provider and a customer (or business stakeholder) with commercial or legal consequences for breach. "We guarantee 99.99% monthly uptime. Customers receive credits if we fall below this."
Relationship: SLI → measures the thing. SLO → defines the target for the SLI. SLA → formalises the commitment and sets consequences.
The chain: "SLI shows we're at 99.95% — we've breached our 99.99% SLO. If this persists, we'll breach the SLA and owe customer credits."