Customer Reliability Engineering English: SLAs, SLOs, and Customer-Facing Reliability Vocabulary

Master the English vocabulary SRE teams use to talk about reliability commitments, incident communication, and customer-facing status updates.

When an outage hits at 2 a.m. and customers are asking questions, your English needs to be fast, clear, and precise. Customer-facing reliability work demands a specific vocabulary — one that bridges technical reality and business communication. This post covers the core terms SRE and platform teams use when talking to customers, writing status pages, and running postmortems.

Core Reliability Agreement Terms

SLA (Service Level Agreement) — a formal, often legally binding contract between a service provider and a customer that specifies the expected level of service, including uptime guarantees and penalties for missing them.

“Our enterprise SLA guarantees 99.9% monthly uptime. If we breach it, customers receive service credits automatically.”

SLO (Service Level Objective) — an internal target for a specific reliability metric, usually stricter than the SLA. Teams use SLOs to catch problems before they become SLA violations.

“We set our SLO at 99.95% so we have a buffer before we hit the 99.9% SLA threshold.”

SLI (Service Level Indicator) — the actual measured metric used to evaluate whether an SLO is being met. Common SLIs include request latency, error rate, and availability.

“Our primary SLI for the checkout service is the percentage of requests completing in under 500ms.”

Error budget — the amount of unreliability a service is allowed to have within a given period while still meeting its SLO. When the error budget is exhausted, teams typically freeze new feature releases.

“We’ve burned through 80% of our monthly error budget in the first two weeks — no new deploys until the 1st.”

MTTR (Mean Time to Recovery) — the average time it takes to restore a service after an incident begins. A lower MTTR indicates a more resilient and operationally mature team.

“After we automated our rollback process, our MTTR dropped from 45 minutes to under 8 minutes.”

MTTA (Mean Time to Acknowledge) — the average time between an alert firing and a team member confirming they are investigating it.

“PagerDuty shows our MTTA spiked last week — we need to review our on-call rotation.”

Incident Communication Terms

Incident communication — the practice of keeping stakeholders and customers informed during an active outage, using clear, non-technical language and regular status cadence.

“During the incident, we sent customer-facing updates every 30 minutes even when we had nothing new to report — silence makes customers anxious.”

Customer-facing status page — a public webpage (often on a separate domain) that shows real-time service health and incident updates. Statuspage.io and Atlassian Status are common tools.

“Update the status page before you post in Slack — customers check it before they email support.”

Business impact statement — a short, plain-English description of how an incident affects customers and their workflows, used in external communications and executive briefings.

“Avoid technical jargon in the business impact statement. Instead of ‘pod eviction storm,’ write ‘some users were unable to log in for approximately 12 minutes.’”

RCA (Root Cause Analysis) — a structured investigation into why an incident occurred, focused on identifying underlying system or process failures rather than assigning individual blame.

“The RCA revealed that the incident was triggered by a config change that bypassed our staging environment review.”

Postmortem — a written document produced after an incident, combining timeline, RCA, business impact, and action items. Blameless postmortems are the industry standard.

“The postmortem is due Friday — make sure the action items have owners and deadlines, not just vague descriptions.”

Real IT Context Phrases

These phrases appear regularly in customer reliability work. Study them as complete units:

  • “We are currently investigating an issue affecting…” — the standard opening for a status page incident update
  • “This incident has been resolved. We apologize for any inconvenience.” — the closing statement after recovery
  • “No customer data was affected during this incident.” — a critical reassurance phrase in security-adjacent incidents
  • “We will publish a full postmortem within 5 business days.” — a commitment phrase used in enterprise incident closures
  • “The root cause was a misconfigured load balancer rule introduced during the maintenance window.” — example of a precise, non-blaming RCA sentence structure

Key Collocations

Learn these as fixed phrases — they are used almost exactly this way in professional SRE writing:

CollocationUsage
burn the error budget”We burned the error budget with that botched migration.”
breach an SLA”Three incidents in one week — we’re at risk of breaching the SLA.”
declare an incident”The on-call engineer declared an incident at 03:47 UTC.”
publish a postmortem”We always publish postmortems publicly to build customer trust.”
restore service”Service was fully restored by 14:22 UTC.”
customer-facing impact”Quantify the customer-facing impact before the exec briefing.”
miss the SLO”We missed the SLO in March due to the CDN provider outage.”

Writing Clear Customer Updates

One of the hardest English skills in SRE work is translating technical findings into customer language. Follow this pattern:

  1. What is happening (in plain terms, no jargon)
  2. Who is affected (all users / users in region X / users using feature Y)
  3. What you are doing (investigating / implementing a fix / monitoring)
  4. When you will update next (specific time, not “soon”)

Bad: “Experiencing elevated p99 latency due to pod scheduling backpressure on node pool eu-west-2c.”

Good: “Some users may experience slow loading times. Our team is actively investigating and will provide an update by 15:00 UTC.”

Practice

Take the last incident or near-miss at your company and write a 3-paragraph postmortem executive summary in English. Include: a business impact statement (one sentence), the root cause (one sentence), and three action items with owners. Aim for zero technical acronyms in the first paragraph — the audience is a non-technical VP. Share it with a colleague for feedback on clarity.