Interview English for SRE Engineers: SLIs, SLOs, Error Budgets, and Incident Management

Ace your SRE interview in English — learn how to discuss SLIs, SLOs, error budgets, and incident management using the precise vocabulary and confident language patterns recruiters expect.

Site Reliability Engineering interviews are some of the most technically demanding in the industry — and for non-native English speakers, they add a communication layer on top of the technical challenge. You need to use precise SRE vocabulary correctly, structure your answers clearly, and demonstrate a philosophy of reliability engineering, not just technical knowledge.

This guide focuses on the specific English patterns, vocabulary, and answer structures for the questions SRE interviews consistently ask.

Understanding What SRE Interviewers Are Looking For

Before the language, understand the mindset. Interviewers for SRE roles are evaluating:

  1. Reliability thinking — do you think in terms of reliability goals, error budgets, and trade-offs?
  2. Incident philosophy — do you approach incidents with a blameless, systematic mindset?
  3. Communication under pressure — can you explain complex reliability concepts clearly?
  4. Operational maturity — have you built and operated systems at scale?

Your language should reflect all four. Vague answers, imprecise vocabulary, or technical knowledge without philosophical grounding will not pass.

Core SRE Vocabulary You Must Use Correctly

SLI — Service Level Indicator

What it is: A specific, measurable metric that indicates the level of service being provided. SLIs are the inputs to your reliability targets.

Examples of SLIs:

  • Request success rate: “The percentage of HTTP requests that return a 2xx response”
  • Latency: “The proportion of requests served in under 300ms”
  • Availability: “The fraction of time the service is reachable”
  • Durability: “The percentage of records that are still retrievable after being written”

Common interview mistake: Calling the metric itself an SLI without specifying what is being measured. Be specific.

Usage in an interview: “For our payment service, the primary SLI was the success rate of payment processing requests — specifically, the ratio of successful payment completions to total attempts, excluding known client errors.”

SLO — Service Level Objective

What it is: The target value for an SLI. The SLO is the internal commitment your team makes about reliability.

Format: [SLI metric] is [target] over [time window]

Examples:

  • “99.9% of requests return a successful response over a rolling 28-day window”
  • “95% of requests are served in under 200ms, measured over a rolling 7-day window”
  • “Availability is at least 99.95% per calendar month”

The SLO is not the SLA. The SLO is internal. The SLA (Service Level Agreement) is the external, contractual commitment — typically lower than the SLO to provide a buffer.

Usage: “Our SLO was 99.9% availability over a rolling 28-day window. We deliberately set it below our SLA of 99.5% to give us an internal buffer to detect and correct issues before they become contractual breaches.”

Error Budget

What it is: The amount of downtime or failure that is permitted within an SLO period — the inverse of the reliability target.

Formula: Error budget = 1 - SLO target

Example calculation:

  • SLO: 99.9% availability over 30 days
  • Available minutes in 30 days: 43,200
  • Allowed downtime: 0.1% × 43,200 = 43.2 minutes per month

How it is used: The error budget is shared between reliability work and feature development. When the error budget is healthy, teams can ship faster. When the budget is exhausted or nearly so, the team should prioritise reliability over new features.

Usage in interview: “We used the error budget as a policy tool. If the error budget was above 50% with two weeks remaining in the month, engineering teams could proceed with risky deployments. If it dropped below 25%, we implemented a freeze on non-critical releases and focused on reliability work.”

Structuring Your SRE Interview Answers

The SRE Answer Framework

For most SRE questions, use this structure:

  1. The context — briefly describe the system or situation
  2. The metric/objective — what you were measuring or trying to achieve
  3. The trade-off or decision — what choice was made and why
  4. The outcome — what happened, what you learned

Key Question: “How do you set an SLO?”

Strong answer structure:

“Setting an SLO starts with understanding what users actually care about — not what’s easy to measure, but what constitutes good service from their perspective. For a read-heavy API, users care about availability and latency. For a payment service, they care about correctness and durability above all else.

I prefer to start by looking at historical performance data. If the service has been running at 99.95% for the past 12 months, setting the SLO at 99.5% is too low — it does not create meaningful pressure to improve. I’d set it at 99.9%, giving us a small buffer for controlled degradation while still holding the team accountable to high standards.

We also negotiate the SLO with the consuming teams — their reliability requirements constrain our budget. A service that the checkout flow depends on cannot have an SLO lower than what checkout needs to meet its own SLO.”

Key Question: “Walk me through how you handle a P1 incident.”

Strong answer using structured language:

“When a P1 fires, my first priority is impact assessment, not diagnosis. I want to know: how many users are affected, what is the failure mode, and is the impact growing? That informs whether I need to escalate immediately or whether I can work through the runbook with the current team.

Simultaneously, I open the incident channel and post the first status update within five minutes — even if I have nothing to report yet. Saying ‘we are aware and investigating’ prevents duplicate escalations and gives stakeholders a place to follow along.

My next action depends on whether there is a clear mitigation — a rollback, a circuit breaker, a DNS change. If yes, I run the mitigation first and diagnose second. If no obvious mitigation, I focus on isolating the failure domain: is this affecting all regions or just one? All user cohorts or a specific segment? That narrows the search space.

Throughout the incident I update the channel every 15-30 minutes with: current impact, what we’ve tried, what we’re trying next, and the revised ETA. I designate a scribe early so I can focus on diagnosis rather than documentation.

After resolution, I schedule the post-mortem within 48 hours. Blameless — we document what happened, what we missed, and what specific changes to systems, processes, or monitoring will prevent recurrence.”

Key Question: “What is an error budget, and how have you used one?”

“An error budget is the operationalisation of an SLO into something actionable. If we have a 99.9% availability SLO over 30 days, we have 43.2 minutes of allowed downtime. That budget belongs jointly to engineering and product — it is not the SRE team’s budget alone.

In practice, I’ve used error budgets in two ways. First, as a deployment gate: we had a policy that a rollout requiring more than 10% of the remaining error budget for the month needed explicit sign-off from the engineering lead. This forced a conversation about risk that would not have happened otherwise.

Second, as a priority signal. When we burned through our error budget in the first week of the month after a bad release, we stopped all feature work and spent two weeks on reliability improvements — better health checks, improved circuit breakers, and a database connection pooling fix that had been in the backlog for months. The next month, we burned less than 20% of the budget.”

Key Question: “Describe a blameless post-mortem you ran.”

“After a 47-minute outage on our primary API, I facilitated the post-mortem with about 12 people — engineers, the on-call, the incident commander, and the product manager.

I opened by reading the blameless culture statement we had at the top of every post-mortem template: ‘We assume everyone made the best decisions they could with the information available to them at the time.’ That statement does real work — it signals that the purpose of the meeting is learning, not accountability.

We went through the timeline chronologically. At each decision point, I asked: ‘What did you know at that moment? What options did you consider?’ — not ‘why did you do that?’ That framing keeps the conversation analytical rather than defensive.

The most valuable part was identifying the systemic contributors. The engineer who deployed the change was not the cause — the real factors were: no canary deployment process for configuration changes, a monitoring gap that meant we detected the issue via customer reports rather than alerts, and a runbook that was 18 months out of date. Those three systemic issues became action items with owners and deadlines.

Two months later, all three were fixed. The next similar incident was caught automatically in under 3 minutes.”

Language Patterns for SRE Interviews

Demonstrating Reliability Thinking

  • “The first question I ask is: what is the impact on users?”
  • “The SLO is not a technical target — it is a business commitment.”
  • “I treat reliability work and feature work as in competition for the same error budget.”
  • “The goal is not 100% uptime — it is the right level of reliability for the cost.”

Discussing Trade-Offs

  • “The trade-off here is between reliability and velocity…”
  • “We could achieve higher availability by [X], but the cost would be [Y]…”
  • “I would make the case to the team that the reliability investment has a higher ROI than the feature in the current quarter, given our burn rate.”

Showing Blameless Culture Fluency

  • “The first thing we do in a post-mortem is establish the facts — not assign responsibility.”
  • “We look for systemic causes, not individual mistakes.”
  • “The question I ask is: what made it easy for this error to have impact?”

Admitting Uncertainty Professionally

  • “I haven’t worked with that specific tool, but the underlying concept is [X], and I’ve implemented similar patterns using [Y].”
  • “That’s at the edge of my experience — here’s how I would approach learning it…”
  • “I want to give you an accurate answer — can I think through that out loud?”

Key Takeaways

  • SLI is the metric. SLO is the target. Error budget is the allowable failure derived from the SLO. Use these precisely — interviewers notice when terms are used interchangeably.
  • Structure incident answers with: impact first → mitigation → diagnosis → communication → post-mortem.
  • Demonstrate blameless culture vocabulary: systemic contributors, learning opportunities, what made the error easy to make.
  • Show reliability trade-off thinking: error budgets are shared between reliability and velocity; neither can have unlimited priority.
  • When discussing SLOs, explain why you chose that target, not just what it is.
  • For uncertainty: be honest, bridge to related knowledge, show a learning approach.

SRE interviews test both technical depth and a specific philosophy. The engineers who pass these interviews are not those with the most technical knowledge — they are those who have internalised the SRE way of thinking about reliability. Your English should reflect that thinking at every answer.