A growth engineer explains A/B testing to a PM: "An A/B test splits users randomly into control (gets the existing experience) and treatment (gets the new feature). We measure whether the treatment caused a statistically significant change in our primary metric. The p-value tells us the probability that we'd see this result if the treatment had no effect. If p < 0.05, we say the result is statistically significant — less than 5% chance it's a fluke. But p < 0.05 doesn't mean the effect is large or important — it just means it's unlikely to be random noise." What is a p-value in an A/B test, and what does statistical significance mean?
p-value: P(data | null hypothesis). The probability of observing results at least as extreme as the measured ones, assuming the null hypothesis is true (i.e., no effect). p = 0.03 means: if the treatment truly had no effect, there would be only a 3% chance of seeing this large a difference by random sampling variation. Statistical significance: p < threshold (typically 0.05 or 0.01). Not the same as practical significance. A tiny effect can be statistically significant with a large sample. Common mistakes: Peeking: stopping the test early because p < 0.05 inflates Type I error rate significantly. Use sequential testing or pre-commit to a sample size. P-hacking: running many tests and reporting only the one with p < 0.05. Novelty effect: users engage more with new features simply because they're new — fades after a few weeks. Run tests long enough to see past it. Statistical vocabulary: Null hypothesis: assumes no effect (treatment = control). The hypothesis you try to reject. Alternative hypothesis: there is an effect (treatment ≠ control). Type I error (false positive): concluding there is an effect when there isn't. Probability = alpha (significance threshold). Type II error (false negative): missing a real effect. Probability = beta. Power = 1 - beta. Confidence interval: the range of values the true effect likely falls within. 95% CI: if we ran the experiment 100 times, 95 of the CIs would contain the true value. In conversation: 'p < 0.05 is a weak threshold for shipping decisions. We require p < 0.01 and a minimum effect size that's business-meaningful before we ship.'
2 / 5
An analytics engineer explains minimum detectable effect: "Before running an experiment, we calculate the required sample size. The key inputs: significance level (alpha = 0.05), statistical power (1 - beta = 0.80), and the minimum detectable effect (MDE) — the smallest improvement we care about detecting. If our baseline conversion rate is 2% and we want to detect a 10% relative improvement (from 2.0% to 2.2%), we need about 80,000 users per variant. If we want to detect a 5% improvement, we need 320,000 per variant. Smaller MDE = larger required sample." What is the minimum detectable effect (MDE) and why does it matter for experiment design?
MDE (Minimum Detectable Effect): the smallest true effect size that the experiment has sufficient statistical power to detect. Sample size relationship: smaller MDE → larger required sample (more data needed to distinguish small effects from noise). Typical MDE choices: 5-10% relative improvement for high-traffic features (many users, short experiment duration), 20-30% for low-traffic features (smaller sample, longer duration). Experiment design vocabulary: Statistical power (1 - beta): probability of detecting a true effect if one exists. Standard: 0.80 (80%). Higher power requires larger sample size. Significance level (alpha): the false positive rate threshold. Standard: 0.05 (5%). Lower alpha requires larger sample size. Sample size calculator inputs: baseline metric value, MDE, alpha, power, traffic split ratio. Randomisation unit: the entity being randomised (user ID, session ID, device ID). User-level is most common — ensures a user always sees the same variant. Session-level leads to switching (same user in control one session, treatment the next). SUTVA (Stable Unit Treatment Value Assumption): the treatment applied to one unit doesn't affect other units. Violated in social networks (viral effects), marketplace (supply/demand effects). Network effects in experiments: if user A in treatment can influence user B in control, your control group is contaminated. Use cluster randomisation (randomise by social graph cluster). In conversation: 'We use an MDE of 5% relative for most experiments. If the true effect is less than 5%, we're probably not excited about shipping it anyway — the experiment duration would be impractical.'
3 / 5
A growth PM explains the AARRR framework to a new team member: "AARRR — Pirate Metrics — maps the user lifecycle: Acquisition (how do users find you?), Activation (do they have a great first experience?), Retention (do they come back?), Revenue (do you monetise them?), Referral (do they tell others?). The activation stage is the most critical and most neglected. Activation is the moment a user first experiences the core value — the aha moment. For Dropbox, it's placing a file in a shared folder. For Slack, it's sending 2,000 messages. If users don't activate, everything else is irrelevant." What is the activation metric and why is it considered the most important funnel stage?
Activation: the stage where an acquired user first experiences the core value proposition of the product. Also called the aha moment. Examples: Dropbox: user uploads their first file and sees it sync. Slack: team sends 2,000 messages (enough to experience threaded conversations, integrations, and searchability). Twitter: follow 30 accounts (enough to have a meaningful feed). Why it's the most important: acquisition and retention are bookends — but if users don't activate, you're filling a leaky bucket. High acquisition + low activation = expensive churn. Improving activation has the highest leverage because it affects every user who joins. AARRR vocabulary: Acquisition: new users arrived (via SEO, paid, referral, social). CAC (Customer Acquisition Cost) = total spend / new customers. Activation: first core value experience. Activation rate = activated users / new users. Retention: users returning. D1/D7/D30 retention = % of users returning after 1/7/30 days. Churn rate = % of users who stop using the product. Revenue: monetization. ARPU (Average Revenue Per User), LTV (Lifetime Value), MRR (Monthly Recurring Revenue). Referral: users bringing other users. Viral coefficient (K): K > 1 = viral growth. Funnel analysis: measuring conversion rate at each step from acquisition to revenue. Cohort analysis: tracking a group of users who share a join date over time — reveals retention curves. In conversation: 'A 10% improvement in activation rate is often worth more than a 30% improvement in acquisition. Activation is the foundation; everything else is built on it.'
4 / 5
An engineer explains feature flags to a PM who wants to ship a big feature: "A feature flag is a runtime switch that enables or disables a feature without a code deploy. We deploy the code with the flag off, gradually enable it for 1% of users, watch metrics, then 10%, 50%, 100%. If something breaks, we toggle the flag off instantly — no rollback deployment needed. Feature flags enable decoupling deployment from release. We've deployed this code to production already; we're releasing it to users gradually." What is the difference between deploying and releasing in the context of feature flags?
Decoupling deployment from release: one of the most valuable practices in modern software delivery. Deploy: push code to production servers. Release: make a feature visible/accessible to users. Without feature flags: these are the same event. With feature flags: deploy continuously (multiple times per day); release deliberately (when ready, to the right audience). Feature flag use cases: Gradual rollout: 1% → 10% → 50% → 100% with metrics checks at each step. A/B testing: flag exposes feature to treatment group. Kill switch: turn off a problematic feature in seconds. Dark launch: send real traffic to new code path without showing users — validates performance/correctness in production. Beta testing: expose to opted-in users only. Feature flag tools: LaunchDarkly, Statsig, GrowthBook (open-source), Unleash, Split.io, AWS AppConfig. Flag types: Release flag: on/off for a feature (temporary, should be removed after full rollout). Experiment flag: percentage-based for A/B tests. Permission flag: feature available to specific users/plans (permanent — premium features). Ops flag: kill switch for performance or safety (emergency). Technical debt: feature flags that are never cleaned up add complexity and tech debt. Each flag should have an owner and an expiry date. In conversation: 'Feature flags are the safety net that lets us ship continuously. Without them, we'd batch features into quarterly releases to reduce deployment risk — which creates the big-bang problems we're trying to avoid.'
5 / 5
A growth lead explains growth loops to a new PM: "A growth loop is a self-reinforcing cycle: user action creates output that brings in new users or deepens engagement. Viral loop: User A signs up → invites User B → B signs up → invites User C. Content loop: User creates content → content is indexed by Google → search traffic finds content → user sees value → creates more content. Product-led growth: engineer uses the product → loves it → recommends to their team → team buys seats. Loops compound over time — they're much more powerful than linear acquisition channels." What is a growth loop and how does it differ from a traditional marketing funnel?
Traditional marketing funnel: Awareness → Interest → Desire → Action. Linear. Terminates at conversion. Each customer requires the same acquisition cost. Doesn't compound. Growth loop: cyclical. Each customer can generate more customers (viral loop) or more acquisition surface (content loop, SEO loop, product-led loop). Compounds over time — more users → more content/referrals → more users. Loop types: Viral loop: users invite other users. Viral coefficient K = invitations per user × acceptance rate. K > 1 = exponential growth. Content loop: user-generated content attracts organic search traffic. Example: Quora, Pinterest, Reddit. Product-led growth (PLG): product value drives acquisition. Individual user discovers product → team adoption → enterprise contract. Example: Slack, Figma, Notion. SEO loop: more users → more data → better product → better rankings → more users. Data loop: more users → better model → better product → more users. Metrics for loops: Viral coefficient (K): average new users generated by each existing user. Cycle time: how long one loop iteration takes. Shorter cycle time = faster compounding. Loop efficiency: % of users who complete the loop action. Growth vocabulary: PLG (Product-Led Growth): using the product itself as the primary acquisition and retention mechanism. Freemium: free tier drives acquisition; paid features drive conversion. Network effect: product becomes more valuable as more users join. In conversation: 'The difference between a $1B company and a $10B company is usually whether they've found a growth loop. Funnels build businesses; loops build category leaders.'