Experimentation, Diagnostics, and Growth Infrastructure for Non-Technical PMs

What's being tested

Interviewers want to see that you can design, prioritize, and interpret online experiments to drive growth while managing risk and operational constraints. They'll probe your ability to pick the right unit of randomization, define primary and guardrail metrics, diagnose anomalous results, and decide whether to roll forward, roll back, or iterate. DoorDash cares because safe, fast, and repeatable experimentation unlocks high-velocity product improvements across marketplace, logistics, and promotions.

Core knowledge

A/B test lifecycle — hypothesis, metric spec, instrumentation QA, randomization, run with pre-specified stopping rules, analysis, and rollout/cleanup; each step owned by PM plus analytics partner.
Unit of randomization (user, session, order, deliverer) — choose to avoid interference; e.g., experiments targeting checkout conversion must randomize at order or user level to prevent cross-order contamination.
Minimum detectable effect (MDE) and power — estimate sample size with $N \propto \frac{(Z_{1-\alpha/2}+Z_{1-\beta})^2\sigma^2}{\Delta^2}$ ; set realistic MDE given traffic and business impact.
Sample Ratio Mismatch (SRM) — always check expected vs. observed treatment allocation; SRM indicates bucketing, hashing, or rollout bugs before trusting results.
Guardrail metrics — define safety KPIs (e.g., DAU, refunds, delivery time, refund rate) and automated alerts to protect growth from harmful winners.
Attribution window & event latency — pick windows that reflect product dynamics (e.g., 7-day order conversion); account for late-arriving events and backfilled data before final decision.
Multiple testing & sequential peeking — understand family-wise error (Bonferroni) and use pre-registered sequential testing or alpha spending to avoid false positives when running many experiments.
Interference / SUTVA violations — anticipate cannibalization or network effects (e.g., pricing changes affecting marketplace supply) and design cluster or holdout experiments when interference is likely.
Instrumentation sources — know common signals (Amplitude, Mixpanel, Snowflake, Looker) and that the PM must validate event definitions, completeness, and user identity stitching before launch.
Feature flag governance — require LaunchDarkly/Optimizely-style toggles, naming standards, experiment registry, and enforced cleanup windows to prevent flag rot and cross-experiment leakage.
Experiment orchestration & metadata — an experiment catalogue with owners, hypothesis, MDE, sample size, start/end, and variants enables governance and prevents overlapping treatments that bias results.
Prioritization for infra investments — use frameworks like RICE to compare manual experiment cost vs. platform work (concurrency needs, cross-platform identity, automated analysis ROI).

Worked example — diagnosing a negative treatment effect on conversion

Frame the problem: confirm the metric, target cohort, time range, and whether the drop is statistically significant or operationally meaningful. Clarifying questions: "What is the unit of randomization? Any SRM alerts? Which downstream metrics changed?" Skeleton of the diagnosis: (1) Validate signal — check SRM, instrumentation, and data completeness in Snowflake/analytics; (2) Segmentation — stratify by device, geography, new vs. returning users to localize the effect; (3) Funnel & timing — inspect upstream steps (exposure, click-through) and downstream (payment failures, refunds); (4) External causes — check releases, partner outages, holidays, or market events. Tradeoff to flag: speed vs. confidence — if the drop is large and affects business-critical guardrails (refunds, p99 delivery), you may roll back immediately; otherwise continue to collect to meet pre-specified power. Close by proposing concrete next steps: pause rollout if guardrail breaches persist, create a deep-dive analytics ticket, and propose an experiment variant that isolates the suspected causal mechanism.

A second angle — scaling experimentation infrastructure for high concurrency

When asked to design infrastructure for many simultaneous experiments, reframe constraints: prevent cross-experiment contamination, automate power calculations and SRM detection, and provide experiment metadata and ownership. Key design pillars: centralized bucketing service with deterministic hashing; automated interference detection (flag correlated effects across experiments); a registry to prevent incompatible experiments on the same unit; and self-serve dashboards for pre/post-mortem analyses. Tradeoffs you’d call out: building a fully automated stack costs engineering time, but reduces false positives/negatives and speeds decisions; incremental investment (feature flags + registry + basic automation) often yields most near-term value.

Common pitfalls

Pitfall: Stopping early because of a tempting p-value
Stopping at a "significant" early reading without pre-specified stopping rules invites false discoveries. Always follow pre-registered analysis plans or use sequential testing; communicate uncertainty when deviating.

Pitfall: Ignoring SRM or instrumentation failures
A statistically significant lift is worthless if bucketing or event logging is broken. Verify allocation, event definitions, and identity stitching before interpreting or surfacing results.

Pitfall: Overclaiming causality from noisy segment analyses
Cherry-picking segments post-hoc (e.g., “it worked for Android only”) without correction or a pre-specified segmentation plan risks misleading stakeholders; present such findings as hypothesis-generating.

Connections

Interviewers may pivot to personalization/ML (how experiments inform models), data engineering (instrumentation and event integrity), or privacy/compliance (how experimentation respects user consent and differential privacy constraints). Be ready to connect technical limits to product tradeoffs.