Advanced 6 topic areas 64+ exercises

SRE / Platform Engineer

SREs are communication hubs during incidents, reliability reviews, and on-call handoffs. This path covers the precise language of error budgets, incident timelines, post-mortem facilitation, and reliability reporting.

Topics covered

  • SLO/SLI/SLA
  • Error budgets
  • Incident response
  • Post-mortem writing
  • Runbooks
  • Chaos engineering

Vocabulary spotlight

4 terms every SRE / Platform Engineer should know in English:

error budget n.

The maximum amount of unreliability permitted by an SLO over a given period

"We've burned 70% of this quarter's error budget after Monday's incident."
SLO n.

Service Level Objective — a target value for a reliability metric such as availability

"Our SLO is 99.9% availability, measured as a rolling 30-day window."
blameless post-mortem n.

An incident retrospective focused on systemic causes, not individual fault

"The blameless post-mortem revealed five contributing factors."
chaos engineering n.

Intentionally injecting faults into a system to verify resilience

"We use chaos engineering to validate that our circuit breakers actually work."
Open full glossary →

📚 Vocabulary Reference

Key terms organised by category for SRE / Platform Engineers:

Reliability Metrics

SLOSLISLAerror budgetburn ratealert thresholdMTTRMTBFavailabilityuptimefive nines

Incident Management

incidentSEV-1SEV-2on-callpageescalationincident commanderrespondermitigationresolutionblameless post-mortem

Observability

metriclogtracespandistributed tracingcardinalityp95p99dashboardalertrunbookplaybook

Reliability Concepts

toilautomationcapacity planningload sheddinggraceful degradationcircuit breakerretry with backoffchaos engineeringfault injection

Infrastructure

KubernetesPrometheusGrafanaPagerDutyDatadogclusternode poolresource limittainttoleration

Deployment Safety

canaryblue-greenrolling updaterollbackchange freezedeployment gatesmoke testreadiness probeliveness probe
Study full vocabulary modules →

Recommended exercises

Real-world scenarios you'll practise

  • Writing a blameless post-mortem after an SEV-1 incident
  • Presenting error budget burn rate to engineering leadership
  • Facilitating a live incident call with multiple teams
  • Drafting an SLO proposal for a new service
  • Writing a SEV-1 customer-facing status page update — honest, calm, no jargon, regular cadence
  • Explaining toil reduction to management — justifying automation investment in business terms
  • Writing a capacity planning proposal — current usage, projections, recommended provisioning, cost estimate

🎯 Interview questions specific to this role

Practise answering these questions out loud — or in writing. Each question targets a real interviewer concern for SRE / Platform Engineers.

  1. How do you define and enforce SLOs for a new service?
  2. Walk me through how you would handle a SEV-1 incident from alert to post-mortem.
  3. How do you balance feature velocity with reliability work?
  4. What is an error budget and how have you used one in practice?
  5. How do you communicate reliability metrics to non-technical stakeholders?
Practice all interview exercises →

Recommended reading

Reference glossaries for SRE / Platform Engineers

Deep-dive glossaries covering terminology specific to this role:

Browse full IT glossary →

Explore another role

🔍 QA & Testing

Open path →