DevOps & SRE

Complete English Guide for DevOps & SRE Engineers

Incident bridges, postmortem facilitation, SLO negotiations, pipeline communication, on-call handoffs — the precise, high-stakes English of keeping systems running.

8 sections · 25+ internal practice links · Intermediate – Advanced

Why English Matters for DevOps & SRE Engineers

DevOps and Site Reliability Engineering are disciplines where communication quality directly affects system reliability. An ambiguous incident update can lead to duplicate mitigation attempts that make an outage worse. A poorly written postmortem that implies blame rather than identifying systemic causes can poison team culture. A runbook with vague steps causes the on-call engineer to lose precious minutes during a production incident at 3am.

The language of DevOps and SRE is highly specialised. It includes the terse, precise language of incident Slack bridges, the structured narrative of postmortem documents, the formal vocabulary of SLA and SLO contracts, the procedural language of runbooks, and the persuasive language used when arguing for reliability investments in business terms. Each of these registers requires different vocabulary and different norms.

Additionally, DevOps and SRE roles are among the most globally distributed in tech. On-call rotations span continents. Incident bridges bring together engineers from multiple countries. Infrastructure-as-code repositories are reviewed by distributed teams. Platform decisions are documented in RFCs read by engineering organisations across multiple time zones. In all of these contexts, English is the operational language.

This guide covers the specific English you need in these high-stakes, time-pressured communication contexts. The goal is not to sound like a native speaker — it is to communicate clearly, precisely, and professionally so that your colleagues understand exactly what is happening and exactly what they need to do.

Section 1: Incident Response Language

Incident communication is one of the most demanding English contexts in the entire field of tech. During an active incident, messages need to be short, precise, and unambiguous. There is no time for politeness conventions that add words without adding meaning. At the same time, you need to maintain a factual, non-blaming tone, coordinate multiple people's efforts, and keep stakeholders informed — all simultaneously.

Incident Bridge Communication

An incident bridge (the Slack channel, Zoom call, or War Room where an incident is managed) has specific communication norms. State facts, not hypotheses — unless you explicitly label something as a hypothesis: "The database CPU is at 97% — we are investigating root cause." vs. "I think maybe the database is having issues." Report changes immediately: "Hypothesis: the deployment at 14:32 introduced the regression." / "Confirmed: rolling back resolves the issue."

Incident roles use specific English titles: the Incident Commander (IC) coordinates the response and has final authority on decisions; the Communications Lead writes external status updates; the Technical Lead investigates and implements fixes; the Scribe documents the timeline. When taking a role, announce it: "I'm taking IC." When you have a finding: "Finding: error rate on the payment service spiked to 4% at 14:35 — coincides with the 14:32 deploy." When escalating: "Escalating to the database team — I need a DBA on this bridge now."

Status updates during an incident follow a template: current status (degraded / down / investigating / mitigated / resolved), impact (which users/services are affected, what is the observable symptom), current action (what is being done right now), and ETA if known. "Update: services are partially degraded. Customers attempting checkout are seeing a 502 error approximately 15% of the time. We have rolled back the 14:32 deployment and are monitoring. ETA to full resolution: ~10 minutes."

Severity Levels and Escalation Language

Incidents are classified by severity (often SEV1–SEV4 or P1–P4). The language varies with severity: SEV1/P1 language is terse and urgent; SEV4/P4 language is more measured. Common patterns: "Declaring SEV1 — all-hands on deck." / "Downgrading to SEV2 — no customer impact confirmed." / "Escalating to SEV1 based on customer impact exceeding 5%." / "Paging the on-call for [team name] — need eyes on the database metrics."

Practice these skills

Section 2: Postmortem & Blameless Culture Writing

Postmortem documents (also called post-incident reviews or incident retrospectives) are one of the most important written artifacts in SRE culture. A well-written postmortem prevents future incidents by making the systemic causes visible and actionable. A poorly written postmortem that assigns blame to individuals prevents learning, damages trust, and discourages honest reporting of mistakes.

Blameless Language

Blameless postmortem writing means focusing on system and process failures rather than individual mistakes. The key technique is using passive constructions and systemic framing: Instead of "John deployed without running the tests," write "The deployment was not blocked by the CI pipeline despite failing tests — the pipeline was misconfigured to allow manual overrides." Instead of "The on-call engineer missed the alert," write "The alert threshold was set too high to catch the degradation in time — the alert fired 12 minutes after the impact began."

This is not dishonesty — it is redirecting attention to the system properties that made the failure possible. The question a blameless postmortem asks is not "who did this?" but "why was it possible for this to happen?" Key phrases: "The system allowed X to occur because..." / "The safeguard that should have prevented X was not in place because..." / "Contributing factor: the runbook did not cover the scenario where..."

Postmortem Structure and Language

A standard postmortem includes: Summary (one paragraph, plain English, no jargon — written for a non-technical reader), Impact (who was affected, for how long, at what scale), Timeline (chronological events with UTC timestamps, passive or active voice, factual), Root Cause Analysis (the systemic "why"), and Action Items (specific, assigned, with due dates). Timeline entries: "14:32 UTC — Deployment v2.1.4 pushed to production." / "14:35 UTC — Error rate on /checkout rises to 4%." / "14:41 UTC — First customer report received via support." Root cause framing: "Root cause: a missing database index caused query latency to increase by 300% under production load. The index was not added as part of the migration because the performance testing environment did not have a representative data volume."

Practice these skills

Section 3: CI/CD Pipeline Communication

CI/CD is the backbone of modern software delivery, and the language around it — in pull request descriptions, pipeline configuration comments, build failure notifications, deployment announcements, and release notes — is highly specific. Understanding and using this vocabulary correctly marks you as someone who is fluent in the culture of modern software delivery.

Pipeline Status and Failure Language

Pipelines "pass," "fail," "time out," or are "blocked." Specific stages "succeed" or "fail." When a pipeline fails, the language for reporting it is precise: "The build is failing on the linting stage — the new formatter config introduced trailing-comma errors in 14 files." / "Pipeline is blocked: the end-to-end tests are flaky — they have failed 3 times in a row on the network timeout scenario." / "Deployment is paused pending approval from the release manager."

Flaky tests are a specific category: tests that "pass intermittently," "fail non-deterministically," or are "environment-dependent." The standard vocabulary: "This test is flaky — it fails roughly 20% of the time due to a race condition in the test setup." / "I'm skipping this test and opening a ticket to fix the flakiness." / "The CI run failed due to an infrastructure issue, not a code problem — I'm re-running."

Deployment Announcements

Deployment announcements (in Slack, email, or a deployment log) follow a standard pattern: what is being deployed, to which environment, from which version to which version, at what time, who is deploying, and the rollback plan. "Deploying v2.3.1 to production at 15:00 UTC. Changes include [link to release notes]. Rollback plan: revert to v2.3.0 — rollback takes approximately 3 minutes. @team, please monitor the error dashboard for 15 minutes post-deploy."

Practice these skills

Section 4: SLO, SLA & Error Budget Discussions

SLOs (Service Level Objectives), SLAs (Service Level Agreements), and error budgets are the core concepts of SRE's approach to reliability. The language around them is precise, numerical, and often used in both technical and business contexts — you need to be able to explain these concepts to engineers and to product managers and executives alike.

SLO and SLA Vocabulary

An SLI (Service Level Indicator) is the measurement: "Our SLI for availability is the percentage of requests that return a 2xx status code." An SLO is the target: "Our SLO is 99.9% availability over a rolling 30-day window." An SLA is the contractual commitment: "Our SLA guarantees 99.5% uptime — below that, customers receive service credits." The hierarchy: SLI (measurement) → SLO (internal target) → SLA (external commitment).

Key phrases: "We're currently at 99.85% against a 99.9% SLO — we have consumed 70% of our error budget this month." / "The SLO for this endpoint is defined as p99 latency under 200ms." / "We need to decide whether to tighten the SLO or invest in reliability improvements to defend the current one." / "This incident burned through 3 days of error budget."

Error Budget Language

Error budgets translate reliability targets into allowable downtime: "A 99.9% SLO over 30 days gives us a monthly error budget of 43.8 minutes." You "spend," "burn," or "consume" the error budget: "The deployment failures last week burned 40% of our monthly budget." When the budget is exhausted, you "freeze feature deployments" and focus on reliability work: "We've consumed 100% of the error budget — engineering and product have agreed to freeze non-essential deployments for the rest of the month."

Practice these skills

Section 5: Kubernetes & Cloud Operations English

Kubernetes and cloud operations generate a dense vocabulary of resource types, states, and operations. Being fluent in this vocabulary is essential for communicating with other engineers, writing runbooks, and describing issues in incident channels.

Kubernetes Resource and State Language

Kubernetes objects have specific state vocabulary: Pods are "Pending," "Running," "Succeeded," "Failed," or "Unknown." They "restart" (CrashLoopBackOff is the state where a pod keeps restarting after crashing). Deployments "roll out," "roll back," or are "paused." Services "expose" deployments. You "scale up" or "scale down" a deployment. Nodes are "Ready," "NotReady," "SchedulingDisabled," or "cordoned." You "drain" a node before maintenance.

Practical communication: "The payment-service pods are stuck in CrashLoopBackOff — I'm checking the logs now." / "I'm cordoning node-3 for maintenance and will drain it once the pods have been rescheduled." / "The HPA is scaling the API pods up to 8 replicas due to increased traffic — CPU utilisation is at 78%." / "Rolling back the deployment — the new image has a misconfigured liveness probe."

Cloud Operations and Cost Language

Cloud operations involve "provisioning," "deprovisioning," "scaling," "right-sizing," "cost optimisation," and "reserved capacity." Resources are "over-provisioned" (too large for the workload) or "under-provisioned" (too small). You "right-size" instances based on observed utilisation. "Spot instances" and "preemptible instances" offer lower cost with the risk of termination. "Reserved instances" or "committed use discounts" reduce cost in exchange for a capacity commitment.

Practice these skills

Section 6: Monitoring & Alerting Vocabulary

Monitoring and observability have their own rich vocabulary that appears constantly in DevOps and SRE communication — in runbooks, incident timelines, alert descriptions, dashboard documentation, and architecture discussions.

Observability: Logs, Metrics, Traces

The three pillars of observability each have specific vocabulary. Logs are "emitted," "structured" or "unstructured," "aggregated," and "shipped" to a log management platform. You "tail" logs in real time or "query" them historically. Metrics are "collected," "scraped" (Prometheus-style), "aggregated," and "visualised." You set "thresholds" and define "anomaly detection" rules. Distributed traces "propagate" across services via "trace context headers" and reveal "latency bottlenecks" across service boundaries.

Key communication patterns: "I'm seeing elevated error rates in the logs — filtering for 5xx responses shows a spike starting at 14:35." / "The latency p99 metric has been trending upwards for the past hour — it crossed our alert threshold of 500ms at 14:42." / "The trace shows the latency is coming from the database query in the user-service, not the API gateway."

Alert Design Language

Good alert descriptions communicate what the alert means and what to do. Alert titles should be noun phrases: "High Error Rate — Payment Service" not "ALERT ALERT ALERT." Alert descriptions should include: what is happening, the current value vs. threshold, which users are affected, and a link to the runbook. "The payment service error rate has exceeded 1% (current: 2.3%). This may indicate a downstream dependency issue or a failed deployment. Investigate: [dashboard link]. Runbook: [link]."

Practice these skills

Section 7: On-Call & Runbook Language

On-call communication and runbook writing represent two ends of a spectrum: runbooks are written carefully and in advance, to be used under pressure; on-call communication happens in real time, often when tired and stressed. Both require clarity, but in different ways.

Runbook Writing Language

Runbooks should be written for a version of yourself who is woken up at 3am and has not worked with this system for six months. Use numbered steps with imperative mood: "1. Check the dashboard: [link]. 2. If the error rate exceeds 5%, page the on-call backend engineer. 3. If the database CPU is above 90%, run: kubectl exec..." Each step should have a single clear action. Avoid "you might want to consider" — write "do X." Include decision branches: "If X is true, proceed to step 5. If X is false, proceed to step 8."

Good runbooks include: the trigger condition (what alert or symptom leads here), the diagnostic steps (what to check and how to interpret what you see), the mitigation steps (what to do to stop the bleeding), the escalation path (who to call if these steps don't work), and the context (what this service does, what the failure mode means).

On-Call Handoff Language

Handoff messages between on-call rotations should cover: the current state of all active incidents or elevated concerns, any known flaky systems to watch, recent deployments that might be relevant, and any follow-up items. "Handoff: No active incidents. Watch: the payment-service has been occasionally returning elevated latency (p99 around 350ms, SLO is 200ms) — investigating, ticket [link]. Recent deploys: user-service v1.4.2 deployed at 09:00 UTC, no issues observed. Action items: [link to Jira]."

Practice these skills

Most Useful Vocabulary & Phrases for DevOps & SRE

SLO / SLA / SLI

'We're burning our error budget — the SLO is 99.9% but we're currently at 99.7%.'

error budget

'This incident consumed 3 days of our monthly error budget.'

blameless postmortem

'We run blameless postmortems — the goal is to find systemic fixes, not to assign fault.'

flaky test

'The CI is failing due to a flaky test — I'm skipping it and opening a ticket.'

CrashLoopBackOff

'The pod is in CrashLoopBackOff — checking the container logs for the crash reason.'

cordon and drain

'I'm cordoning node-3 and draining it before the maintenance window.'

p99 latency

'Our p99 latency is 450ms against a 200ms SLO — this needs investigation.'

runbook

'There's a runbook for this scenario — I'll follow the steps in [link].'

canary deployment

'We're rolling out v2.1 as a canary to 5% of traffic first.'

rollback

'Initiating rollback to v2.0 — the error rate spiked to 8% after the deploy.'

toil

'This manual process is pure toil — I'm automating it so we never have to do it again.'

blast radius

'We're limiting the blast radius by deploying to one region at a time.'

immutable infrastructure

'We follow immutable infrastructure principles — servers are never patched in place.'

infrastructure drift

'I found infrastructure drift between staging and production — the security groups differ.'

golden path

'We provide a golden path for service deployment — teams can get a new service running in 10 minutes.'

four golden signals

'We monitor the four golden signals: latency, traffic, errors, and saturation.'

escalation path

'If the runbook steps don't resolve it, the escalation path goes to the platform team.'

mean time to recovery (MTTR)

'Our MTTR has improved from 45 minutes to 12 minutes since we added the automated rollback.'

availability zone

'The service is deployed across three availability zones for fault tolerance.'

breaking change

'This Terraform change is a breaking change — it requires manual state migration.'

Recommended Learning Path for DevOps & SRE Engineers

1
Incident Response Language
Start with the highest-stakes communication context in DevOps/SRE — incident bridge communication, status updates, and severity escalation.
2
SLO Engineering Language
SLOs, SLAs, error budgets, and reliability target vocabulary — essential for any SRE role.
3
CI/CD Pipeline Language
Pipeline status communication, deployment announcements, and release management vocabulary.
4
Kubernetes Operations Language
Resource state vocabulary, operational communication, and troubleshooting language for Kubernetes environments.
5
Observability Engineering Language
Logs, metrics, traces, and alert vocabulary — the language of monitoring systems at scale.
6
Terraform & IaC Operations
Infrastructure-as-code vocabulary for Terraform, Ansible, and related tools.
7
Post-Incident Facilitation
Blameless postmortem writing, facilitation language, and action item tracking vocabulary.
8
DevOps Engineer Interview Questions
Practice for DevOps and SRE technical interviews, covering both technical and behavioural questions.
9
SRE Interview Questions
SRE-specific interview preparation covering reliability engineering principles and on-call scenarios.

Also explore

Browse all learning paths → All exercises Other role guides

Exercise Sets for DevOps & SRE Engineers

Practise the vocabulary and communication patterns covered in this guide with these focused exercise sets:

Vocabulary exercises

DevOps & Cloud Vocabulary — containers, blue-green deployment, IaC, SLOs, observability
Kubernetes Deep Dive Vocabulary — Pod lifecycle, RBAC, Service types, storage
Cloud Architecture & FinOps Vocabulary — cloud cost management and optimisation

Code reading & collocations

Code Reading & Description exercises — read Dockerfiles, Kubernetes manifests, and config files
IT Collocations exercises — spin up, tear down, roll back, provision, scale — DevOps verb patterns

Interview preparation

Technical Interview exercises — STAR method, system design language, behavioural questions
DevOps Engineer Interview Questions — role-specific practice
SRE Interview Questions — reliability engineering interview preparation