5 exercises on writing actionable risk entries, phased rollout descriptions, rollback plans, monitoring thresholds, and open questions in a design doc.
Open questions: specific decision + options + cost + deadline + owner.
0 / 5 completed
1 / 5
You are writing a risk entry in a design doc. Which version is the most professional and actionable?
A professional risk entry must name the specific trigger, the blast radius, and a concrete mitigation with a monitoring hook.
Option C delivers all four: it names the operation (schema migration, orders table), quantifies the scale (78M rows, 90s lock), states the impact (checkout failures), names the tool (pt-online-schema-change), schedules the window (Sunday 02:00 UTC), and links to a dashboard. A reviewer reading this can immediately judge the risk and verify the mitigation plan.
Quantify where possible: row counts, duration estimates, error-rate thresholds.
Always name the observability hook: "monitor on the X dashboard", "alert fires if Y exceeds Z".
Option A is generic. Option B names the impact but offers no mitigation or timing. Option D is a label-only entry — useful in a risk register table but not as a prose risk statement.
2 / 5
Which sentence describes a phased rollout most precisely?
A phased rollout description must specify the percentage, the trigger to advance, the metric to watch, and the exit criteria for each phase.
Option C gives exact traffic percentages (1% → 10% → 100%), dates (Weeks 1–3), the advancement condition (Phase 1 metrics within SLO), the observability metrics (p99 latency, error rate), and a bake period before flag removal. Reviewers and on-call engineers can execute this plan without further clarification.
Standard phased rollout vocabulary: canary, traffic percentage, bake period, feature flag, dark launch, blue/green swap.
State the advancement gate explicitly: "if Phase N metrics remain within SLO".
The final phase should include a flag-removal or cutover step to avoid shadow debt.
Options A and B lack specifics. Option D defers the design work, which defeats the purpose of the design doc.
3 / 5
Which rollback plan is described most clearly and reliably?
A rollback plan must name the exact mechanism, the time estimate, who executes it, and any conditional path (e.g., if a migration has already run).
Option C does this: it names the mechanism (feature flag in LaunchDarkly), the time (<1 min), the benefit (no code deployment), and the conditional branch (if migration ran → compensating migration with a file name and runbook link). An on-call engineer at 3am can execute this without further research.
Feature flags are the fastest rollback path — mention them explicitly when available.
Always address the database migration case separately: schema changes are often harder to revert than code.
Link to a runbook or playbook for multi-step rollbacks: "see RUNBOOK-447".
Options A and B provide no actionable information. Option D outsources the plan to unspecified "standard procedures" that may not exist.
4 / 5
A design doc has this monitoring section: "We will monitor the system after launch." Which rewrite makes it actionable?
A monitoring plan must name the specific signals, their alert thresholds, the duration before alerting, and where alerts route.
Option C specifies three named metrics (checkout.success_rate, order_creation.p99_latency, kafka.consumer_lag), each with a threshold value and a sustained-duration condition (avoids flapping). Alerts route to a named channel (#incidents-checkout) and service (PagerDuty). The 72-hour heightened coverage provision shows operational maturity.
Always pair a threshold with a duration: ">X for >Y minutes" prevents alert noise from transient spikes.
Name the consumer-lag metric if your design involves a message queue — this is the earliest signal of downstream failure.
State the alert routing: "fires to Slack channel", "pages PagerDuty policy", "creates a JIRA ticket".
Options A, B, and D defer specifics to after deployment. That is too late — monitoring must be agreed as part of the design, not retrofitted.
5 / 5
You are writing the open questions at the end of a design doc's risk section. Which entry is most useful to reviewers?
An open question must name the specific decision, the options under consideration, the cost or consequence of each, the deadline, and the owner.
Option C does all of this: the question is specific (Kafka retention policy), the current proposal is stated (7 days), the trigger for change is explained (expanding event sourcing, ARCH-88), the cost implication is quantified ($1,200/month), the decision deadline is hard and dated, and the owner is named. A reviewer can immediately see what is blocking and escalate if needed.
Open questions are not placeholders — they are design decisions that need external input.
Always state why the question is open: what new information or decision would resolve it?
Include a deadline and owner: without these, open questions linger forever.
Link related issues or RFCs where they exist: "tracked in ARCH-88".
Options A, B, and D are placeholders that add no value. Option D in particular ("TBD") signals an unfinished doc.