5 exercises on SRE reliability practice — indicators, objectives, and budgets.
0 / 5 completed
1 / 5
What is an SLI in SRE?
A Service Level Indicator (SLI) is a carefully chosen, quantitative measurement of a service's behavior — typically a ratio of good events to total events. Common SLIs include availability (successful requests / total requests), latency (proportion of requests faster than a threshold), and error rate. A good SLI reflects what users actually care about and is measurable from real traffic. SLIs are the raw signal on which objectives and alerting are built; choosing meaningful ones is the foundation of reliability practice.
2 / 5
What is an SLO?
A Service Level Objective (SLO) is a target you set for an SLI over a defined time window — for example, "99.9% of requests succeed over 30 days." It expresses the reliability bar the team commits to internally, balancing user happiness against the cost and pace of development. SLOs are deliberately not 100%, because perfect reliability is impossibly expensive and stifles change. They differ from an SLA, which is a contractual promise to customers with penalties; the SLO is usually stricter than the SLA.
3 / 5
What is an error budget?
An error budget is the complement of the SLO: if your availability SLO is 99.9%, then 0.1% of requests may fail — that 0.1% is the budget. It reframes reliability as a resource to spend. While budget remains, teams can ship features and take calculated risks; when it is exhausted, the policy is to freeze risky changes and focus on stability. This aligns developers and operators around a shared, data-driven decision rule instead of arguing subjectively about whether to release.
4 / 5
What is burn rate in SLO-based alerting?
Burn rate measures how quickly you are consuming your error budget compared to a sustainable pace. A burn rate of 1 spends the entire budget exactly over the SLO window; a burn rate of 10 would exhaust a 30-day budget in 3 days. Alerting on burn rate — often with multiple windows (a fast one for sharp outages, a slow one for steady degradation) — produces alerts that are both timely and low-noise, firing on genuine threats to the SLO rather than on every transient blip.
5 / 5
What is toil in SRE?
Toil is the kind of operational work that is manual, repetitive, automatable, tactical, devoid of enduring value, and grows linearly as the service grows — think manually restarting services, applying repetitive config changes, or handling routine tickets. SRE practice caps the time spent on toil (often around 50%) so engineers retain capacity for engineering that durably reduces future work. Identifying and automating away toil is a core SRE responsibility, because unchecked toil crowds out improvement and burns people out.