An SRE lead explains the error budget policy to a development team: "Our SLO is 99.9% availability, giving us 43 minutes of error budget per month. The policy has three tiers. At 50%+ budget remaining: normal release velocity, no restrictions. At 10-50% remaining: caution mode — all deployments require SRE sign-off and a rollback plan. At under 10% remaining: budget freeze — no non-critical releases until budget recovers. At 0%: reliability sprint — engineering focuses entirely on SRE work until budget is restored." What is an error budget policy and why is it important for the relationship between development and SRE teams?
Error budget policy: an agreed-upon contract between development and SRE that specifies what happens at different error budget consumption levels. Why important: without it, reliability vs. velocity is a subjective, political negotiation after every incident. With it: the policy was agreed when everyone was calm — now the rule triggers automatically based on objective metrics. Key aspects: Pre-agreed: dev and SRE agree on the policy before incidents happen. Reduces conflict. Objective trigger: the threshold (10%, 50%) is measurable, not a judgment call. Graduated response: increasing restrictions as budget depletes — not binary on/off. Budget recovery: the policy should specify when restrictions lift. Error budget math: 99.9% SLO → 0.1% error budget. Over 30 days: 0.001 × 30 × 24 × 60 = 43.2 minutes. Over a year: 8.77 hours. Advanced SRE vocabulary: Budget exhaustion: budget reaches 0. Triggers the most restrictive tier. Budget recovery: error budget replenishes naturally as the window rolls. With a 30-day rolling window, an incident from 31 days ago is no longer counted. Budget borrowing: sometimes justified for critical business reasons — requires executive sign-off and a reliability repayment plan. SLO target setting: too tight an SLO means the budget is always exhausted. Too loose means no meaningful reliability signal. Aspirational SLO vs. contractual SLO: internal target (tighter) vs. customer commitment (looser, with buffer). In conversation: 'The error budget policy is the most important organizational agreement in SRE. It transforms reliability from "the SRE team's problem" into a shared engineering concern with objective, pre-agreed consequences.'
2 / 5
A senior SRE explains the concept of toil at a team planning session: "Toil is not just hard work. The SRE definition is specific: manual, repetitive, automatable, tactical (reactive rather than proactive), scales with service growth (O(n) with the workload), and adds no enduring value. Restarting a service manually because it crashes weekly: toil. Writing the automation to detect and restart it: not toil — that's engineering work. The rule: toil should never exceed 50% of an SRE's time. If it does, the team needs to either automate or escalate." What distinguishes toil from regular engineering work, and why does the 50% toil cap matter?
Toil characteristics (Google SRE definition): Manual: a human is in the loop where automation could replace them. Repetitive: done again and again, same steps. Automatable: a machine could do it. Tactical: interrupt-driven, reactive. Doesn't advance the system's architecture. Scales with load: if traffic doubles, toil doubles. Toil is O(n) with service growth. No enduring value: complete it today; it's the same problem next week. Toil examples: manually provisioning new service instances, restarting services after crashes, rotating credentials manually, responding to alerts that fire but don't require human judgment. Non-toil examples: writing runbooks (enduring value), building alerting automation (eliminates future toil), improving deployment pipeline (engineering work). 50% toil cap: if more than half of SRE time is toil, the team cannot invest in the engineering work that would reduce toil — a toil spiral. Beyond 50% also signals that the team is operating as ops, not SRE. Corrective actions: automate the toil source, hand back to development team if toil is caused by bad code quality, escalate to leadership with data if under-staffed. Toil vocabulary: Toil budget: explicit allocation of acceptable toil time. Toil reduction sprint: focused work to automate or eliminate a specific toil source. Toil taxonomy: classifying toil by source (code bugs, deployment issues, capacity, etc.). Automation gap: the opportunity to eliminate toil through automation. In conversation: 'When an SRE complains about toil, ask: is this automatable? If yes, write a ticket. If they've been saying the same thing for 6 months, something is wrong with how we prioritise.'
3 / 5
An SRE lead facilitates a blameless postmortem after a database outage: "The word 'blameless' is precise: we focus on systems and processes, not individual decisions. 'The engineer made the wrong call' is blame — it ends investigation and demoralises people. 'The system had no guardrail that would have flagged this call as risky' is blameless — it identifies an improvable system. We use 5 Whys: Why did the DB go down? Disk full. Why? Logs not rotated. Why? Log rotation misconfigured. Why? New service deployed without the checklist. Why? Checklist not enforced by automation. That's the systemic cause." What is the difference between a proximate cause and a systemic cause in a postmortem?
Proximate cause: the direct, immediate cause — what triggered the incident. Example: disk filled up, causing the database to stop accepting writes. Systemic cause: the underlying condition that allowed the proximate cause to happen and to have the observed impact. Example: log rotation was not configured (systemic), which allowed logs to grow unchecked (systemic), which caused disk full (proximate). 5 Whys technique: repeatedly ask "why?" to move from proximate to systemic. Typically 3-5 levels deep reaches systemic causes. Blameless postmortem vocabulary: Blameless: the system failed, not the person. The question is not "who made the mistake?" but "what system conditions allowed this mistake to have this impact?" Contributing factor: a condition that contributed to the incident without being the sole cause. Often multiple contributing factors. Latent defect: a bug or misconfiguration that existed before the incident but didn't trigger a failure until a second condition appeared. Aggravating factor: a condition that worsened the impact but didn't cause it. Action items: specific, measurable, owned, time-bound improvements. Good: "Add disk usage alert at 80% for all database hosts (Owner: Alice, Due: 2026-06-15)." Bad: "Monitor disk better." What went well: documenting successful responses in postmortems — reinforces good practices. In conversation: 'A postmortem that ends at the proximate cause without systemic analysis will see the same incident again in 6 months with a different trigger. The systemic cause is the only thing worth fixing.'
4 / 5
An SRE explains Production Readiness Reviews to a team preparing to launch: "A PRR is a structured assessment before a new service goes to production — or before an existing service takes significantly more traffic. We review: SLOs defined? Dashboards and alerts set up? Runbooks written? Load tested? Rollback procedure documented? On-call rotation established? Failure modes analysed? If critical items aren't met, we don't block launch — we work with the team to either address them or document the accepted risk." What is the purpose of a Production Readiness Review (PRR) and what distinguishes it from a security review or code review?
Production Readiness Review (PRR): a structured engagement where SRE evaluates a service's operational readiness. Focus areas: Observability: SLIs defined? Dashboards? Alerts covering the four golden signals? Structured logging? Distributed tracing? Reliability: SLOs defined and agreed? Error budget policy? Failure mode analysis (what happens when dependencies fail)? Circuit breakers? Scalability: load tested? Bottlenecks identified? Auto-scaling configured? Operational readiness: runbooks written? On-call rotation includes the team? Escalation path defined? Deployment automation? Rollback procedure documented and tested? Security: (shared with security review) least privilege, secrets management. PRR vs. other reviews: Code review: correctness, style, architecture, test coverage. Security review: authentication, authorization, input validation, cryptography, data handling. PRR: can we operate this service reliably in production? PRR outcomes vocabulary: Launch blocker: critical PRR finding that must be resolved before launch. P1/P2 action item: required action post-launch with defined timeline. Accepted risk: a known gap documented and accepted by the service team with a mitigation plan. SRE engagement model: how SRE works with a development team (embedded, consulting, product SRE). Hand-back criteria: conditions under which SRE returns an unstable service to the development team for reliability investment. In conversation: 'The PRR isn't a gate — it's a partnership. We sit with the team and work through the checklist together. Half the value is the questions the checklist raises, not the sign-off at the end.'
5 / 5
An SRE manager explains the embedded vs. centralised SRE model: "In the embedded model, SRE engineers sit inside product teams — one or two SREs per team. They're deeply aligned with the product, move fast, and have strong context. The trade-off: they can be captured by the team and pulled into development work. In the centralised model, SREs own a platform or specific reliability domain; product teams consume services and support. Faster standardization, but slower feedback loops. Most large orgs use hybrid: a central SRE platform team plus embedded SRE consultants who interface with product teams." What is SRE capture and why is it a risk in the embedded model?
SRE capture: the risk that an embedded SRE becomes de facto a developer — the team has so much feature work that reliability work is deprioritised. The SRE loses their reliability perspective and advocacy role. Mechanisms: embedded SRE gets pulled into sprint planning as a developer, code reviews, feature work. Manager is the product team manager who values feature velocity. No SRE leadership visibility into the SRE's work. Prevention: SRE reporting chain: embedded SREs should have a dotted line to SRE leadership for performance reviews and career development. Toil budget enforcement: SRE leadership monitors the 50% toil cap. PRR participation: embedded SREs must participate in PRRs — maintains reliability accountability. Rotation: some orgs rotate SREs between teams to prevent capture and cross-pollinate practices. SRE organisational vocabulary: Embedded SRE: SRE sitting within a product team. Context-rich, product-aligned. Platform SRE: owns shared reliability infrastructure (observability platform, deployment system). Product teams consume services. CRE (Customer Reliability Engineering): SREs who work with external customers (large enterprises) on their reliability practices. Google's external SRE consultancy model. SRE engagement model: the defined relationship between SRE and a product team. Engagement types: on-call support, consulting, embedded. Hand-back: when SRE returns operational responsibility to the development team — typically when a service has exceeded its error budget too consistently, indicating reliability investment is needed from the dev team. In conversation: 'SRE capture is insidious — it happens gradually over months. The signal is when the SRE's sprint tasks look identical to the developers' sprint tasks. At that point, you've lost your reliability engineer.'