A/B Testing and Growth Infrastructure for Non-Technical PMs

What's being tested

Interviewers want to see that you can design, prioritize, and interpret experiments that drive sustainable growth while protecting the business. That means framing a clear hypothesis, choosing the correct unit of analysis and success/guardrail metrics, sizing the experiment for a meaningful minimum detectable effect (MDE), and owning rollout/risk decisions (ramp, kill conditions, rollback). DoorDash cares because safe, fast learning at scale is how product teams trade off velocity versus potential harm to customers and merchants.

Core knowledge

A/B testing basics: randomize units (user, session, device) to treatment/control, ensure randomization reproducibility and tie exposure logs to downstream metrics for causal attribution.
Unit of analysis tradeoffs: randomize by user to measure per-user conversion; by order to measure per-order metrics; choose the unit that avoids cross-contamination and aligns with the primary business metric.
Minimum Detectable Effect (MDE) and sample sizing: approximate per-arm sample size for proportions using $n\approx\frac{(Z_{1-α/2}+Z_{1-β})^2\cdot 2\bar p(1-\bar p)}{Δ^2}$ where $\bar p$ is baseline and $Δ$ is absolute lift. Use 80–90% power and typical $α=0.05$ .
Statistical risks: Type I error (false positive) and Type II (false negative). Avoid peeking without sequential-correction (alpha spending, O’Brien–Fleming) or use pre-registered stopping rules.
Multiple comparisons: apply Bonferroni or Benjamini–Hochberg FDR control when testing multiple variants/metrics; prefer FDR for many correlated metrics.
Guardrail metrics: always define at least one negative-impact metric (e.g., cancellation_rate, p99_latency, DAU) and stop thresholds tied to business impact rather than only statistical significance.
Exposure & instrumentation: log deterministic assignment + exposure timestamp + decision metadata to Postgres/event stream (Kafka) so analysts can join treatment to outcomes; verify via an exposure QA funnel before ramping.
Identity stitching and contamination: if users switch devices, ensure consistent bucketing via deterministic hashing on stable user_id; if impossible, prefer session/order-level randomization and call out bias.
Sequential ramping and rollout: use conservative initial rollouts (e.g., 1% → 10% → 50%) with automated health checks and manual go/no-go gates.

Tip: automate kill-switches for revenue and safety guardrails.

Business prioritization: score experiments by expected value = baseline * expected lift * revenue per event * probability of success; use ICE/RICE to balance effort and potential.
Analysis perspective: report absolute and relative lifts, confidence intervals, and “value impact” (e.g., incremental weekly revenue), and check heterogeneity by key segments (new vs returning users, city-size).
When experiments are infeasible: use quasi-experimental approaches (difference-in-differences, regression discontinuity) but treat findings as lower confidence and enumerated assumptions.

Worked example — “Design an A/B test to increase checkout conversion rate”

First 30s framing questions: clarify the precise metric (e.g., completed order / initiated checkout), unit of randomization (user vs checkout), target population (all users, new users), and acceptable MDE and risk tolerances. Outline assumptions about expected baseline conversion and seasonality.

Skeleton of an answer:

Hypothesis and metric: “Reducing friction on the payment screen will increase completed checkouts; primary metric = checkout conversion rate; guardrail = average_order_value and cancellation_rate.”
Randomization plan: bucket by stable user_id to avoid cross-contamination; deterministic hashing ensures consistent assignment across sessions.
Sizing & ramp: compute MDE that matters (e.g., 2% relative lift), calculate sample size with 80% power, and start with 1% rollout for QA → 10% → full once exposures and analytics checks pass.
Monitoring & kill rules: predefine thresholds for guardrails (e.g., if cancellation_rate ↑ by 0.5ppt and statistically significant, pause); use automated alerts tied to telemetry.

Tradeoff to flag: choosing user-level randomization increases required sample size (vs per-checkout) but avoids contamination; explain sample/time tradeoff. Close with: “If more time, I’d pre-register analysis, run power simulations with variance from historical data, and plan segment analyses for new vs returning users.”

A second angle — “Experiment to reduce delivery cancellations”

Framing shifts: primary metric is cancellation_rate (a rare event) and unit should be order or driver, not user. Power calculations must account for low baseline (rare-event) requiring much larger samples or higher-cost interventions targeted at high-risk orders. There are stronger network/peer effects: changing routing or incentives may affect driver behavior system-wide, causing interference between units. You’d prefer a targeted pilot in a few zones (cluster randomization by city/zone) and use cluster-adjusted sample-size formulas. Also emphasize operational guardrails (driver earnings, SLA) and rollback plans because driver satisfaction impacts marketplace health.

Common pitfalls

Pitfall: Designing experiments without predefining the primary metric and guardrails. Without a pre-registered primary metric you’ll chase significance on noisy secondary metrics and make bad product decisions.

Pitfall: Randomizing at the wrong unit. Randomizing orders when the feature impacts user behavior (or vice versa) causes contamination and biased estimates; explicitly justify unit choice and show how you'll measure cross-unit interference.

Pitfall: Peeking and ramping without corrections. Stopping as soon as a p-value drops below 0.05 inflates false positives; instead use pre-specified sequential rules or conservative alpha spending and report confidence intervals and practical significance.

Connections

Interviewers may pivot to experimentation platforms (tradeoffs between building in-house vs Optimizely/LaunchDarkly), metric-definition infrastructure (consistent DAU/MAU definitions in Looker), or to causal inference topics (difference-in-differences, treatment effect heterogeneity) if they want to push on analysis depth.