3 exercises — master the essential metrics and terms every on-call engineer needs: SLI/SLO/SLA, MTTR/MTTD, and severity levels.
0 / 3 completed
1 / 3
A product manager asks: "What's the difference between an SLA, SLO, and SLI?" Which definition set is correct?
Option A is the correct industry-standard definition.
SLI (Service Level Indicator) — the raw metric you measure. Examples: request success rate, latency p99, error rate, availability %.
SLO (Service Level Objective) — your internal target for the SLI. Example: "99.9% of requests must succeed per rolling 30 days." SLOs are set by engineering teams and are aspirational. They are how you decide whether your service is "healthy".
SLA (Service Level Agreement) — a legal/contractual commitment to a customer. If you breach an SLA, there are consequences (refunds, escalation, etc.). SLAs are almost always set lower than your internal SLO to give you a buffer.
Relationship: SLI is measured → compared against SLO → if SLO is breached repeatedly → SLA may be breached → penalties apply.
2 / 3
In a post-mortem, a colleague writes: "MTTR was 52 minutes. MTTD was 28 minutes." What do these terms mean?
Option B is correct. These are the standard incident reliability metrics:
MTTD — Mean Time To Detect — the average time between when a problem first occurs and when your monitoring/alerting first detects it. In this example: 28 minutes passed before the alert fired. A high MTTD means your observability is poor.
MTTR — Mean Time To Recovery — the average time from when the incident started to when service was fully restored. In this example: 52 minutes total duration. (Some teams define MTTR as "Time To Repair", meaning the same thing.)
MTBF — Mean Time Between Failures (not shown here) — average time between incidents of the same type. A high MTBF means your system is stable.
These three metrics are core to SRE dashboards. Reducing MTTD (better alerting) and MTTR (better runbooks, automation) are common action items in post-mortems.
3 / 3
Your team classifies incidents by severity. Which definition of P1 and P0 is most accurate for most engineering teams?
Option B is the standard definition. Severity levels differ between organisations but the common pattern is:
P0 / SEV-0 — catastrophic. Complete platform down, critical data breach, payment processing failed entirely. All-hands war room. C-level may be notified. CEO/CTO is often pinged.
P1 / SEV-1 — critical. Major feature down for a significant percentage of users, SLA likely breached, on-call team mobilised immediately. Example: checkout broken for 30% of users.
P2 / SEV-2 — major. Significant but partial impairment. Response during business hours acceptable. Example: image uploads slow by 3×.
P3 / SEV-3 — minor. Small impact, workaround available. Schedule fix in next sprint.
Important: P0 is more severe than P1. Lower number = more critical. This is the opposite of what many non-technical people assume. Always clarify your organisation's severity matrix in the runbook.