Toil: toil, O(n) growth, automation, toil budget, engineering work vs. toil
Production readiness: PRR (Production Readiness Review), go/no-go, launch readiness, capacity planning, rollout plan
0 / 5 completed
1 / 5
An SRE team lead introduces reliability concepts: "An SLI is a quantitative measure of reliability — like request success rate. An SLO is the target we set for that SLI — 99.9% success rate over 30 days. The SLA is the contractual commitment with customers — often less strict than the SLO, with financial penalties for breach. The difference between SLO and SLA gives us internal buffer." What is the relationship between SLI, SLO, and SLA?
SLI (Service Level Indicator): a quantitative measurement of a service characteristic relevant to users. Examples: availability (% requests returning 2xx), latency (% requests completing under 300ms), throughput (requests per second), error rate (% of requests returning 5xx). SLO (Service Level Objective): the internal target for an SLI. Example: "99.9% of requests succeed over a rolling 30-day window." SLOs drive engineering decisions: what to prioritise, when to release, when to slow down. SLA (Service Level Agreement): a contract between a service provider and customer defining the minimum acceptable performance and consequences of breach. Typically less strict than the SLO: if SLO = 99.9%, SLA might be 99.5% — giving the team 0.4% buffer to investigate and fix before customers have a claim. SLO vocabulary: Rolling window — a continuously moving time window (e.g., last 30 days). Calendar window — calendar month. Rolling is more operational; calendar aligns with business reporting. Availability target — the SLO expressed as a percentage. 99.9% = three nines; 99.99% = four nines; 99.999% = five nines. SLO violation — the SLI falls below the SLO target. In conversation: "Our SLO is 99.95% but our SLA is 99.5% — that gap is our operational margin. If we're close to the SLO, we act; we never let it get near the SLA."
2 / 5
An SRE engineer explains error budgets to the product team: "Our SLO is 99.9% availability. That means we can have 0.1% bad time — that's our error budget. Over 30 days, 0.1% is about 43 minutes. If we've already used 35 minutes of downtime, we have 8 minutes left. If we burn through the entire budget before the month ends, our error budget policy kicks in: we freeze non-critical releases until the budget recovers." What is an error budget and how does it influence release decisions?
Error budget: the allowable downtime or error rate derived from the SLO. Formula: Error budget = 1 - SLO target. 99.9% SLO → 0.1% error budget. 0.1% of 30 days = 43.2 minutes. Error budget is consumed by: Incidents — outages, degradations. Releases — deployments that have higher risk of introducing failures. Planned maintenance — scheduled downtime. Error budget policy: a set of rules governing what happens at different budget levels. Example: >50% remaining → normal release velocity; 10–50% remaining → caution, extra testing required; 0–10% remaining → freeze all non-critical releases; 0% remaining → focus 100% on reliability, no new features. Error budget vocabulary: Budget remaining — how much error budget is left in the current window. Burn rate — how fast the budget is being consumed relative to the budget refill rate. Fast burn — burning budget faster than it replenishes; an alert indicator. Slow burn — slow, sustained budget consumption. Budget exhaustion — error budget reaches 0. In conversation: "The error budget gives both engineering and product a shared language for reliability vs. velocity trade-offs — when the budget is healthy, ship fast; when it's low, slow down."
3 / 5
An SRE explains burn rate alerting: "We alert on burn rate, not just raw SLI values. A burn rate of 1.0 means we're consuming the error budget at exactly the rate it replenishes — we'll exactly exhaust it by end of the month. A burn rate of 6 means we're consuming it 6× faster — we'll be out in 5 days. We alert at burn rate 14.4 for 1-hour windows and burn rate 6 for 6-hour windows — these catch different failure modes." Why is alerting on burn rate better than alerting on raw error rate?
Burn rate alerting: instead of alerting "error rate > 0.1%", alert "we're burning the error budget 14.4× faster than normal." Multi-window alerting (Google SRE Workbook): Short window, high burn rate (e.g., 1-hour window, burn rate 14.4) — detects acute, fast-burning incidents. 14.4 burn rate for 1 hour uses 5% of the monthly budget. Long window, moderate burn rate (e.g., 6-hour window, burn rate 6) — detects slow burns that wouldn't individually trigger high-rate alerts but accumulate significantly. Combined, these two alert windows: catch 95%+ of impactful SLO failures, while avoiding alerting on brief spikes that don't meaningfully consume the budget. Alert vocabulary: Alert fatigue — too many alerts that are acted on rarely; desensitises on-call engineers. Symptom: alerts are acknowledged without investigation. Symptom-based alerting — alert on user-visible impact (high error rate), not internal causes (CPU spike). Cause-based alerting — alert on internal metrics; good for dashboards, bad for pages. Page — a high-priority alert that wakes someone up. Should be actionable and urgent. In conversation: "Before burn rate alerts, we missed a 3-day slow degradation that exhausted the budget — the error rate was only 0.05% above normal each hour but it added up."
4 / 5
An SRE lead reviews on-call health with the team: "Our on-call is unsustainable. Pager load is 25 pages per week per person — industry best practice is under 5. Most pages are for things that can't be fixed at 3am. We have a toil problem: 60% of our time is responding to the same recurring issues. We need to invest in automation and runbook improvement. The goal is to make on-call boring — rare pages that are always actionable." What does alert fatigue mean and why is it dangerous for reliability?
Alert fatigue: the psychological and operational state where on-call engineers are overwhelmed by alert volume, causing them to acknowledge alerts without thorough investigation ("alert blindness"), miss critical signals among noise, make poor triage decisions, and experience burnout. Why dangerous: a critical real incident gets lost in 50 noise alerts → delayed response → larger customer impact. On-call health vocabulary: Pager load — number of pages (actionable alerts) per on-call person per week. Google SRE target: <5 per 12-hour shift. Industry acceptable: <5 per week. Runbook — a document describing how to respond to a specific alert or situation. Good runbooks: include what to check, what to do, escalation path. Playbook — similar to runbook; sometimes used interchangeably; sometimes "playbook" covers a broader incident type. On-call rotation — the schedule defining who is on-call when. Escalation policy — the sequence of people to contact if the first responder doesn't acknowledge within N minutes. Handoff — transferring on-call responsibility between shifts; includes a summary of open issues. Postmortem backlog — a list of incidents awaiting postmortem writing; should be short. In conversation: "We deleted 40 alerts last quarter — none of them had triggered a meaningful response in 3 months. Pager load dropped from 20 to 4. The team stopped hating on-call."
5 / 5
An engineering team prepares for a launch: "Before we launch, we do a PRR — Production Readiness Review. We go through a checklist: SLOs defined, alerting configured, runbooks written, load testing done, rollback plan documented, capacity estimated. The PRR output is a go/no-go decision. If we're not ready on any critical item, we don't launch — or we accept the risk explicitly in writing." What is a Production Readiness Review (PRR) and what does the go/no-go decision involve?
PRR (Production Readiness Review): a structured process SRE teams use to verify a service meets reliability, scalability, and operational standards before launch. PRR checklist categories: Reliability — SLOs defined, error budgets calculated, dependencies identified. Observability — metrics, logs, and traces in place; dashboards built; SLO alerting configured. Operations — runbooks written, on-call rotation assigned, escalation policy defined. Capacity — load testing completed, capacity estimated for expected traffic, autoscaling configured. Deployment — rollback plan documented, canary deployment strategy defined, feature flags in place. Security — threat model reviewed, vulnerabilities addressed, least-privilege access. Go/no-go vocabulary: Go — all critical items are addressed; launch approved. No-go — one or more critical items are unresolved; launch blocked. Launch with risk acceptance — a no-go item is noted but the business explicitly accepts the risk in writing; used for time-sensitive launches. Soft launch — launch to a subset of users; limited blast radius. In conversation: "The PRR found we had no runbook for the database failover scenario — that was a no-go item. We wrote the runbook, re-reviewed, and launched the next week."