Error Budget Communication
5 exercises — Practice vocabulary for communicating error budgets: consumption, burn rate, feature freeze, error budget policy, and explaining the concept to product managers.
0 / 5 completed
1 / 5
In a reliability review, an SRE says: "We've consumed 40% of our error budget this month." A PM asks what this means for the product team. Which explanation is correct?
The error budget translates an abstract SLO percentage into a concrete, trackable "spending account" for unreliability — the team knows exactly how much reliability they've "spent" and how much they have left for the rest of the month.
The error budget concept (from Google's SRE book) creates a shared language between engineering and product. For a 99.9% SLO, the monthly error budget is 0.1% × 30 days × 24 hours × 60 minutes = 43.2 minutes of allowed downtime. At 40% consumed by day 15, the team has used ~17 minutes and has ~26 minutes remaining for 15 more days — a roughly balanced rate. The power of framing it as "40% consumed" vs. "we had 17 minutes of downtime" is that it's relative to the allowed budget, not just an absolute number. The PM should understand: above 50% consumption at mid-month = start monitoring; above 100% = reliability-first mode triggered.
Key vocabulary:
• error budget — the allowed amount of unreliability (failures, downtime) derived from the SLO for a given period
• error budget consumption — the percentage of the error budget that has been used so far in the current period
• SLO (Service Level Objective) — a target reliability level (e.g., 99.9% availability) that defines the error budget
The error budget concept (from Google's SRE book) creates a shared language between engineering and product. For a 99.9% SLO, the monthly error budget is 0.1% × 30 days × 24 hours × 60 minutes = 43.2 minutes of allowed downtime. At 40% consumed by day 15, the team has used ~17 minutes and has ~26 minutes remaining for 15 more days — a roughly balanced rate. The power of framing it as "40% consumed" vs. "we had 17 minutes of downtime" is that it's relative to the allowed budget, not just an absolute number. The PM should understand: above 50% consumption at mid-month = start monitoring; above 100% = reliability-first mode triggered.
Key vocabulary:
• error budget — the allowed amount of unreliability (failures, downtime) derived from the SLO for a given period
• error budget consumption — the percentage of the error budget that has been used so far in the current period
• SLO (Service Level Objective) — a target reliability level (e.g., 99.9% availability) that defines the error budget