A/B Testing and Growth Infrastructure for Non-Technical PMs
Asked of: Product Manager
Last updated

What's being tested
Interviewers want to see that you can design, prioritize, and interpret experiments that drive sustainable growth while protecting the business. That means framing a clear hypothesis, choosing the correct unit of analysis and success/guardrail metrics, sizing the experiment for a meaningful minimum detectable effect (MDE), and owning rollout/risk decisions (ramp, kill conditions, rollback). DoorDash cares because safe, fast learning at scale is how product teams trade off velocity versus potential harm to customers and merchants.
Core knowledge
-
A/B testing basics: randomize units (user, session, device) to treatment/control, ensure randomization reproducibility and tie exposure logs to downstream metrics for causal attribution.
-
Unit of analysis tradeoffs: randomize by user to measure per-user conversion; by order to measure per-order metrics; choose the unit that avoids cross-contamination and aligns with the primary business metric.
-
Minimum Detectable Effect (MDE) and sample sizing: approximate per-arm sample size for proportions using where is baseline and is absolute lift. Use 80–90% power and typical .
-
Statistical risks: Type I error (false positive) and Type II (false negative). Avoid peeking without sequential-correction (alpha spending, O’Brien–Fleming) or use pre-registered stopping rules.
-
Multiple comparisons: apply Bonferroni or Benjamini–Hochberg FDR control when testing multiple variants/metrics; prefer FDR for many correlated metrics.
-
Guardrail metrics: always define at least one negative-impact metric (e.g.,
cancellation_rate,p99_latency,DAU) and stop thresholds tied to business impact rather than only statistical significance. -
Exposure & instrumentation: log deterministic assignment + exposure timestamp + decision metadata to
Postgres/event stream (Kafka) so analysts can join treatment to outcomes; verify via an exposure QA funnel before ramping. -
Identity stitching and contamination: if users switch devices, ensure consistent bucketing via deterministic hashing on stable
user_id; if impossible, prefer session/order-level randomization and call out bias. -
Sequential ramping and rollout: use conservative initial rollouts (e.g., 1% → 10% → 50%) with automated health checks and manual go/no-go gates.
Tip: automate kill-switches for revenue and safety guardrails.
-
Business prioritization: score experiments by expected value = baseline * expected lift * revenue per event * probability of success; use
ICE/RICEto balance effort and potential. -
Analysis perspective: report absolute and relative lifts, confidence intervals, and “value impact” (e.g., incremental weekly revenue), and check heterogeneity by key segments (new vs returning users, city-size).
-
When experiments are infeasible: use quasi-experimental approaches (difference-in-differences, regression discontinuity) but treat findings as lower confidence and enumerated assumptions.
Worked example — “Design an A/B test to increase checkout conversion rate”
First 30s framing questions: clarify the precise metric (e.g., completed order / initiated checkout), unit of randomization (user vs checkout), target population (all users, new users), and acceptable MDE and risk tolerances. Outline assumptions about expected baseline conversion and seasonality.
Skeleton of an answer:
-
Hypothesis and metric: “Reducing friction on the payment screen will increase completed checkouts; primary metric = checkout conversion rate; guardrail =
average_order_valueandcancellation_rate.” -
Randomization plan: bucket by stable
user_idto avoid cross-contamination; deterministic hashing ensures consistent assignment across sessions. -
Sizing & ramp: compute MDE that matters (e.g., 2% relative lift), calculate sample size with 80% power, and start with 1% rollout for QA → 10% → full once exposures and analytics checks pass.
-
Monitoring & kill rules: predefine thresholds for guardrails (e.g., if
cancellation_rate↑ by 0.5ppt and statistically significant, pause); use automated alerts tied to telemetry.
Tradeoff to flag: choosing user-level randomization increases required sample size (vs per-checkout) but avoids contamination; explain sample/time tradeoff. Close with: “If more time, I’d pre-register analysis, run power simulations with variance from historical data, and plan segment analyses for new vs returning users.”
A second angle — “Experiment to reduce delivery cancellations”
Framing shifts: primary metric is cancellation_rate (a rare event) and unit should be order or driver, not user. Power calculations must account for low baseline (rare-event) requiring much larger samples or higher-cost interventions targeted at high-risk orders. There are stronger network/peer effects: changing routing or incentives may affect driver behavior system-wide, causing interference between units. You’d prefer a targeted pilot in a few zones (cluster randomization by city/zone) and use cluster-adjusted sample-size formulas. Also emphasize operational guardrails (driver earnings, SLA) and rollback plans because driver satisfaction impacts marketplace health.
Common pitfalls
Pitfall: Designing experiments without predefining the primary metric and guardrails. Without a pre-registered primary metric you’ll chase significance on noisy secondary metrics and make bad product decisions.
Pitfall: Randomizing at the wrong unit. Randomizing orders when the feature impacts user behavior (or vice versa) causes contamination and biased estimates; explicitly justify unit choice and show how you'll measure cross-unit interference.
Pitfall: Peeking and ramping without corrections. Stopping as soon as a
p-valuedrops below 0.05 inflates false positives; instead use pre-specified sequential rules or conservative alpha spending and report confidence intervals and practical significance.
Connections
Interviewers may pivot to experimentation platforms (tradeoffs between building in-house vs Optimizely/LaunchDarkly), metric-definition infrastructure (consistent DAU/MAU definitions in Looker), or to causal inference topics (difference-in-differences, treatment effect heterogeneity) if they want to push on analysis depth.
Further reading
-
[“Practical Guide to Controlled Experiments on the Web” (Kohavi et al.)] — foundational paper on experimentation best practices and pitfalls.
-
[“Trustworthy Online Controlled Experiments” (Tang et al., Microsoft)] — deep dive into platform-level considerations and sequential testing.
Related concepts
- Experimentation, Diagnostics, and Growth Infrastructure for Non-Technical PMs
- Diagnostics, A/B Testing, Estimation, and Growth Infrastructure Fundamentals
- PM Technical Fundamentals for Growth Experimentation
- Growth Diagnostics, Metric Trees, Estimation, and A/B Testing
- Technical Fundamentals for Non-Technical Product Managers
- Technical Fundamentals for Non-Technical Product Managers