SRE Platform Engineer
SRE Platform Engineers are "SREs for SREs" — they build the reliability infrastructure that other engineers use: SLO platforms, alerting frameworks, incident automation, and observability tooling. Their English work includes writing platform documentation for service teams, presenting reliability platform roadmaps to engineering leadership, and writing SLO definition guides. This path covers the vocabulary of reliability engineering platforms and tooling.
Topics covered
- SLO platform engineering
- Alerting-as-code
- Runbook automation
- Observability platform
- Incident tooling
- Reliability metrics
Vocabulary spotlight
4 terms every SRE Platform Engineer should know in English:
The amount of unreliability an SLO allows — derived from the SLO target (e.g., 99.9% availability = 0.1% error budget). When the budget is exhausted, reliability work takes priority over new features
"We burned through 80% of the monthly error budget in the first week, triggering a freeze on new deployments."
Managing alert rules, routing, and escalation policies in version-controlled code rather than through UI — enabling review, testing, and rollback of alerting changes
"We converted all PagerDuty routing rules to alerting-as-code using Terraform, eliminating configuration drift between environments."
Automating the steps in a runbook so that common operational responses (drain traffic, restart service, scale up) can be triggered by on-call engineers with a single command or automatically on alert
"Our runbook automation reduced mean time to mitigate for the top 10 alert types by 60%."
Operational work that is manual, repetitive, tactical, and automatable — SRE practice aims to keep toil below 50% of an engineer's time
"We measured that manual certificate rotation was generating 4 hours of toil per engineer per week and prioritised automating it."
📚 Vocabulary Reference
Key terms organised by category for SRE Platform Engineers:
SLO Platform
Alerting & Runbooks
Observability Platform
Reliability Culture
Recommended exercises
Real-world scenarios you'll practise
- Writing an SLO definition guide for service teams: explaining error budget calculation, alerting burn rate alerts, and when to freeze deployments
- Presenting the reliability platform roadmap to engineering leadership: connecting platform capabilities to the engineering KPIs of MTTR, change failure rate, and toil reduction
- Writing runbook automation documentation: explaining how on-call engineers should use, contribute to, and test automated runbooks
- Running a reliability platform onboarding workshop: teaching service teams to define SLOs, write alerts-as-code, and interpret error budget dashboards