5 exercises — practice structuring strong English answers for SRE EM interviews: team structure, error budget, incidents, toil, and on-call programme management.
How to structure SRE EM interview answers
Team structure questions: embedded vs. centralised model → team API → SRE engagement model (error budget gates, SRE readiness review)
Error budget questions: define error budget → burn rate → policy enforcement → escalation path when burn rate is high
The interviewer asks: "How would you structure an SRE team at a 200-person engineering organisation?" Which answer is most considered?
Option B is strongest: it names three models with a risk/fit analysis for each, identifies the centralised model's specific failure mode (outsourcing operational thinking), explains the embedded model's graduation mechanism (readiness criteria), recommends model (3) with a rationale specific to 200 engineers, and correctly challenges the naive headcount ratio with the right metric (on-call sustainability and decreasing SRE operational load). SRE organisation vocabulary:Embedded SRE — an SRE engineer temporarily embedded in a product team to transfer reliability practices. Reliability threshold — the set of criteria (SLOs, runbooks, alert quality) a team must meet to operate independently. Readiness review — a formal SRE assessment of whether a service is ready for production. Platform SRE — an SRE team that owns shared reliability infrastructure rather than individual services. Engagement programme — the structured process by which SRE works with product teams. Options C and D are accurate but lack the centralised model risk analysis and the ratio challenge.
2 / 5
The interviewer asks: "How do you enforce error budget policies when development teams resist?" Which answer is most effective?
Option B is strongest: it reframes resistance as a policy design problem before an enforcement problem (the key insight), explains why mechanistic rules work better than judgement-based rules, names the alignment prerequisites (PM co-development, not handdown), provides the specific escalation framing (business argument with dollar figures, not SRE opinion), and closes with the counter-intuitive insight that SRE consultancy earns the trust that makes enforcement unnecessary. SRE policy vocabulary:Error budget policy — a predefined rule specifying what happens when an error budget is depleted or burning too fast. Burn rate — the rate at which the error budget is being consumed relative to the budget period. Reliability sprint — a sprint dedicated to reliability improvement, triggered by error budget policy. Mechanistic policy — a rule triggered by measurable data, not human judgment. Policy co-development — designing the policy jointly with affected stakeholders so they own it. Options C and D are accurate but lack the design-before-enforcement framing and the PM advocacy insight.
3 / 5
The interviewer asks: "What does a healthy on-call programme look like?" Which answer is most comprehensive?
Option B is strongest: it provides five properties with specific measurable thresholds (5 alerts, 4 engineers minimum, 6-8 target), explains the psychological mechanism behind rotation size (feeling permanently on-call with fewer than 4), introduces compensation as a design element, explains what a good runbook enables (unfamiliar engineer to mitigation without tribal knowledge), and names the absence signal for psychological safety (engineers quietly struggling instead of escalating — a management insight). On-call programme vocabulary:Alert actionability rate — the proportion of alerts that required a human response and meaningful action. Alert fatigue — desensitisation to alerts caused by high volume or low actionability. Rotation size — the number of engineers in the on-call rotation. Gameday — a planned exercise where teams test runbooks and failure modes in a controlled environment. Tribal knowledge — undocumented context required to respond to an incident effectively. Options C and D are accurate but lack the absence signal explanation and the psychological permanence mechanism.
4 / 5
The interviewer asks: "How do you measure the health of your SRE team?" Which answer is most multi-dimensional?
Option B is strongest: it organises metrics into three named dimensions (outcomes, sustainability, impact) with reasoning for each, explains why toil >50% signals the team has become an ops team (a definitional insight from the SRE book), introduces the on-call experience survey as a tool, names the repeat incident rate as a postmortem programme health metric, and identifies the anti-metric (loose SLO setting for appearance) and the correct leading indicator. SRE health metrics vocabulary:Toil — manual, repetitive, automatable operational work that scales with service load. Toil budget — the maximum acceptable fraction of SRE time spent on toil (typically 50%). MTTD (Mean Time to Detect) — average time from incident start to alert. MTTR (Mean Time to Resolve) — average time from detection to resolution. Repeat incident rate — the proportion of incidents caused by the same root cause category; high rate signals postmortem action items are not being completed. On-call experience score — a qualitative survey measuring the on-call responder's perception of the shift quality. Options C and D are accurate but lack the ops-team-vs-SRE-team distinction for toil and the anti-metric explanation.
5 / 5
The interviewer asks: "What's the difference between an SRE team and an ops team?" Which answer is most precise?
Option B is strongest: it provides three named dimensions with a structural test for each, introduces the removal test (if SRE disappeared tomorrow, would reliability survive?) as a sharp definitional tool, uses the toil metric as the cultural distinguisher, introduces the temporal investment framing (SRE engineers bored = healthy signal), and names the specific signals of a rebranded ops team (postmortem actions never close, toil never eliminated). SRE vs. ops vocabulary:Reliability as a product — the SRE team produces reliability systems and practices, not just incident response. Toil — the metric that distinguishes SRE from ops; high toil = ops work. Self-sustaining reliability — product teams can maintain reliability without continuous SRE involvement. Removal test — would the system remain reliable if the SRE team were removed? Rebranded ops team — an ops team renamed SRE without the cultural and process transformation. Options C and D are accurate but lack the removal test and the "bored SRE" sustainability insight.