Apply multi-window SLOs, error budget policies, burn rate alerting, SLO vs SLA distinctions, and budget-based alert fatigue reduction.
0 / 5 completed
1 / 5
What is a multi-window SLO and why is it more robust than a single time window?
Multi-window SLO: a 30-day rolling window smooths short outages, potentially hiding a 2-hour outage that consumes a significant chunk of error budget. Adding a 1-hour window triggers alerts for sharp reliability drops. Google SRE recommends the 5%-burn-rate-for-1-hour and 2%-burn-rate-for-6-hour multi-burn-rate alerting model.
2 / 5
What is an error budget policy and what actions does it define?
Error budget policy: without a policy, an exhausted error budget is just a number. The policy gives it teeth: engineering leadership agrees upfront that at 75% budget consumption, the on-call team can block feature deployments. This creates a shared incentive — product teams want to ship features, operations wants reliability, and the policy balances them.
3 / 5
What is burn rate in SLO alerting and how is it calculated?
Burn rate: if your 30-day SLO is 99.9% (0.1% error budget), a burn rate of 14.4 means you're consuming the entire monthly budget in 2 hours (30 days / 14.4). Multi-burn-rate alerts: page immediately at high burn rate, ticket at low sustained burn rate — balancing alert fatigue against missing slow degradation.
4 / 5
What distinguishes an SLO from an SLA in practice?
SLO vs SLA: you set your internal SLO at 99.9% and commit in an SLA to 99.5%. The gap is the safety buffer — if your error budget runs low, you can take corrective action before breaching the SLA and triggering financial penalties or customer credits. SLOs are the day-to-day engineering target.
5 / 5
What is SLO alerting strategy based on budget burn vs threshold-based alerting?
Burn rate alerting: a static threshold of "alert if error rate > 0.1%" fires constantly during normal spikes. A burn rate alert fires only when errors are consuming the budget fast enough to exhaust it within a meaningful timeframe. This reduces alert fatigue while ensuring real reliability problems get attention.