Writing A/B Experiment Result Summaries
Practice writing clear experiment result summaries for engineering and product stakeholders.
A/B experiment summary structure
- Hypothesis: "We hypothesised that [intervention] would [metric] by [mechanism]."
- Significance: statistical (p-value, CI) vs. practical (business-scale translation)
- Groups: control + treatment described precisely, with traffic split and stratification
- Result language: "We observed a lift of X% in [metric] (treatment Y% vs. control Z%)"
- Next steps: ship decision + post-launch monitoring plan + archival
Question 0 of 5
Which sentence correctly writes a hypothesis statement for an A/B experiment?
"We hypothesised that..." is the standard opening for an experiment hypothesis — it must name the intervention, the expected metric impact, and the underlying mechanism.
- Intervention named — "replacing the multi-step checkout form with a single-page layout".
- Expected metric impact — "increase the checkout completion rate".
- Mechanism named — "by reducing form abandonment at step 3" — a hypothesis is stronger when it explains why the change is expected to work, not just that it will work.
- Baseline cited — "41% drop-off at step 3" — anchors the hypothesis in current data, showing why this intervention was prioritised.
Which sentence correctly uses "statistically significant" vs. "practically significant" vocabulary?
Statistical and practical significance are distinct and must be reported separately.
- Statistically significant — "p = 0.003, 95% CI: +2.1% to +4.7%" — the result is unlikely to be due to random variation. The confidence interval tells us the plausible range of the true effect.
- Practically significant — "3.4% lift translates to approximately 280 additional completed checkouts per week" — translates the abstract percentage into a business outcome.
- A result can be statistically significant but practically negligible (very large sample detects a 0.1% lift).
- A result can be practically significant but not statistically significant (not enough data to rule out chance).
Which passage correctly describes the treatment and control groups in an experiment summary?
An experiment description must name both groups with their traffic split, exactly what each group experienced, and how randomisation was controlled.
- Traffic split stated — "50% of traffic" for each — stakeholders need to know the exposure to assess statistical power.
- Each group described precisely — "existing three-step checkout form" vs. "redesigned single-page checkout" — not just "old" and "new".
- Randomisation method — "randomly assigned, stratified by device type" — stratification is important when device type is a confounding variable (mobile users behave differently from desktop users). Without stratification, an unequal split of mobile users could confound the results.
Which sentence correctly uses "we observed a lift of X% in metric Y" language?
Experiment result language must include: the lift magnitude, the metric name, both group values, the measurement window, and the sample size.
- Lift magnitude — "3.4%" — the difference expressed as the relative or absolute change (here, absolute).
- Metric name — "checkout completion rate" — not just "the metric".
- Both group values — "68.1% vs. 65.8%" — the absolute values let readers independently verify the arithmetic and assess baseline levels.
- Measurement window — "14-day exposure window" — experiment duration affects whether novelty effects have washed out.
- Sample size per variant — "42,000 users per variant" — lets readers assess statistical power independently.
Which passage correctly writes next steps after an experiment conclusion?
Next steps after an experiment must cover: the ship/no-ship decision, the post-launch monitoring plan, and the documentation step.
- Ship decision with context — "statistically significant 3.4% lift" — the conclusion restates the evidence so the next steps are self-contained.
- Monitoring plan — "monitor checkout completion rate and average order value daily for two weeks" — experiment conditions are controlled; production launch introduces new variables. Post-launch monitoring catches regression or interaction effects the experiment could not detect (e.g., a holiday spike, a concurrent backend change).
- Secondary metric — "average order value" — ensures the primary metric improvement does not come at the cost of another KPI (e.g., users complete checkout faster but spend less).
- Archival step — "archive experiment results in the A/B test log" — institutional memory. Future experiments that touch the same feature will benefit from knowing what was already tested.