Learn the vocabulary of treating an internal developer platform as a reliability-critical product.
0 / 5 completed
1 / 5
At standup, a dev mentions treating the internal developer platform itself as a product with its own uptime target and on-call rotation, since every application team's own reliability now depends on it staying up. What is this practice called?
Platform reliability engineering treats the internal developer platform itself as a product with its own uptime target and on-call rotation, recognizing that every application team's own reliability now depends on that platform staying up. Treating the platform as a best-effort side project ignores that its outage cascades into an outage for every team building on top of it. This formal reliability discipline reflects the platform's role as genuinely critical, shared infrastructure rather than an optional convenience layer.
2 / 5
During a design review, the team wants a clear service-level objective defined specifically for the platform's own provisioning API, separate from the reliability target of any individual application built on top of it. Which capability supports this?
A dedicated SLO for the platform's own core services, like its provisioning API, sets a clear, measurable reliability target for that shared infrastructure specifically, separate from any individual application team's own SLO. Relying only on each application's own SLO ignores that the platform sits underneath all of them and needs its own tracked target. This dedicated SLO makes the platform team accountable for the reliability of the specific thing every other team actually depends on.
3 / 5
In a code review, a dev notices the platform team runs a regular game day exercise simulating a platform-level outage, verifying every dependent application team knows how to detect and respond to it. What does this represent?
Cross-team game day exercises simulate a platform-level failure and verify that every dependent application team actually knows how to detect and respond to it, rather than assuming that knowledge exists untested. Assuming without testing risks discovering, during a real outage, that a dependent team had no real plan for this scenario. This practiced, cross-team exercise is essential given how many separate teams' reliability now cascades from a single shared platform's health.
4 / 5
An incident report shows a platform outage cascaded into a reliability failure for a dozen dependent application teams, several of whom had no documented plan for what to do when the shared platform itself went down. What practice would prevent this?
Requiring every dependent team to maintain and regularly test a documented fallback plan for a platform-level outage ensures they aren't improvising for the first time during an actual incident. Assuming teams will improvise effectively with no documented plan required is exactly what left a dozen teams unprepared in this incident. This required, tested fallback planning is a necessary complement to the platform team's own reliability engineering, since a shared platform failure realistically can still happen despite that engineering effort.
5 / 5
During a PR review, a teammate asks why the platform team maintains a dedicated SLO and on-call rotation for the platform itself instead of treating it as informal shared infrastructure with no separate reliability commitment. What is the reasoning?
Every dependent application team's own reliability now cascades directly from the shared platform's health, since their services can't function correctly if the platform underneath them is down. This makes the platform itself a critical production service in its own right, deserving the same formal SLO and on-call commitment any other critical service would get. The tradeoff is the real operational cost of staffing and maintaining that dedicated on-call rotation for the platform team.