Advanced 6 topic areas 25+ exercises

SRE Platform Engineer

SRE Platform Engineers are "SREs for SREs" — they build the reliability infrastructure that other engineers use: SLO platforms, alerting frameworks, incident automation, and observability tooling. Their English work includes writing platform documentation for service teams, presenting reliability platform roadmaps to engineering leadership, and writing SLO definition guides. This path covers the vocabulary of reliability engineering platforms and tooling.

Topics covered

  • SLO platform engineering
  • Alerting-as-code
  • Runbook automation
  • Observability platform
  • Incident tooling
  • Reliability metrics

Vocabulary spotlight

4 terms every SRE Platform Engineer should know in English:

error budget n.

The amount of unreliability an SLO allows — derived from the SLO target (e.g., 99.9% availability = 0.1% error budget). When the budget is exhausted, reliability work takes priority over new features

"We burned through 80% of the monthly error budget in the first week, triggering a freeze on new deployments."
alerting-as-code n.

Managing alert rules, routing, and escalation policies in version-controlled code rather than through UI — enabling review, testing, and rollback of alerting changes

"We converted all PagerDuty routing rules to alerting-as-code using Terraform, eliminating configuration drift between environments."
runbook automation n.

Automating the steps in a runbook so that common operational responses (drain traffic, restart service, scale up) can be triggered by on-call engineers with a single command or automatically on alert

"Our runbook automation reduced mean time to mitigate for the top 10 alert types by 60%."
toil n.

Operational work that is manual, repetitive, tactical, and automatable — SRE practice aims to keep toil below 50% of an engineer's time

"We measured that manual certificate rotation was generating 4 hours of toil per engineer per week and prioritised automating it."
Open full glossary →

📚 Vocabulary Reference

Key terms organised by category for SRE Platform Engineers:

SLO Platform

SLOSLISLAerror budgeterror budget policyburn rateburn rate alertfast burnslow burnmulti-window alert

Alerting & Runbooks

alerting-as-codealert rulerouting ruleescalation policyrunbookrunbook automationrunbook linkplaybookon-call schedulePagerDuty policy

Observability Platform

OpenTelemetrymetrics pipelinetrace backendlog aggregationexemplarcardinalityscrape configrecording ruledashboard templateservice catalog

Reliability Culture

toiltoil budgetreliability targetblameless postmortemchange failure rateMTTRMTTDDORA metricsproduction readiness reviewlaunch checklist
Study full vocabulary modules →

Recommended exercises

Real-world scenarios you'll practise

  • Writing an SLO definition guide for service teams: explaining error budget calculation, alerting burn rate alerts, and when to freeze deployments
  • Presenting the reliability platform roadmap to engineering leadership: connecting platform capabilities to the engineering KPIs of MTTR, change failure rate, and toil reduction
  • Writing runbook automation documentation: explaining how on-call engineers should use, contribute to, and test automated runbooks
  • Running a reliability platform onboarding workshop: teaching service teams to define SLOs, write alerts-as-code, and interpret error budget dashboards

Recommended reading

Explore another role

🔗 Technical Architect (Integration)

Open path →