Reliability Communication Language
5 exercises — Practice communicating reliability posture to leadership, enterprise customers, and product teams: framing metrics, transparency, and outlier reporting.
0 / 5 completed
Quick reference: Reliability communication principles
- Add scale context — percentages mean more with request counts
- Surface outliers — don't hide below-SLO services in aggregates
- Use measured data — never state aspirational targets as facts
- Reframe from capability to business impact — what does downtime cost?
1 / 5
An SRE is presenting the monthly reliability summary to the VP of Engineering. They say: "The checkout service maintained 99.94% availability — consuming 28% of our error budget — across 2.3 billion requests this month." Why is mentioning the request volume important in this context?
Contextualizing reliability metrics with scale makes abstract percentages meaningful.
"99.94% availability" sounds modest. But at 2.3 billion requests/month:
• 0.06% failure rate = ~1.38 million failed requests
• Achieving 99.94% means ~2.299 billion requests served correctly
Scale context transforms the number from an abstract percentage into a demonstration of engineering quality at a tangible scale.
Other useful context:
• Traffic patterns (e.g., "including 3 major sale events")
• Comparison to industry benchmarks
• Trend over time ("up from 99.87% last month")
Key vocabulary:
• Reliability at scale — framing availability in terms of absolute request counts
• Scale context — volume information that makes percentage metrics meaningful
• Availability numerator — the "good requests" count behind the percentage
• Trend framing — comparing current to previous period to show direction
"99.94% availability" sounds modest. But at 2.3 billion requests/month:
• 0.06% failure rate = ~1.38 million failed requests
• Achieving 99.94% means ~2.299 billion requests served correctly
Scale context transforms the number from an abstract percentage into a demonstration of engineering quality at a tangible scale.
Other useful context:
• Traffic patterns (e.g., "including 3 major sale events")
• Comparison to industry benchmarks
• Trend over time ("up from 99.87% last month")
Key vocabulary:
• Reliability at scale — framing availability in terms of absolute request counts
• Scale context — volume information that makes percentage metrics meaningful
• Availability numerator — the "good requests" count behind the percentage
• Trend framing — comparing current to previous period to show direction