Diagnostics, A/B Testing, Estimation, and Growth Infrastructure Fundamentals

What's being tested

Interviewers are probing your ability to diagnose product health, design and interpret A/B tests, produce quick but defensible estimates, and reason about the instrumentation and processes that let growth experiments scale. For a DoorDash PM, this means making fast, data-driven launch/kill decisions that protect customer experience (orders, delivery reliability, earnings) while moving core metrics like DAU, conversion, and order frequency.

Core knowledge

Metric taxonomy: Know primary metric (north star), key drivers (e.g., visits → add-to-cart → checkout → order), and guardrail metrics (e.g., p99 latency, average driver earnings, refunds) you must protect during experiments.
Sanity checks / invariants: Validate event volumes, identity counts, and platform-level invariants (e.g., unique users non-increasing week-over-week) before any causal analysis to catch instrumentation regressions.
Experiment design basics: Randomization unit (user, session, marketplace), holdout allocation, one-primary-metric policy, and pre-specifying duration, sample size, and stopping rules to avoid p-hacking.
Sample size / power: Use the two-sample proportion/sample-mean formula: $n \approx \frac{(Z_{1-\alpha/2}+Z_{1-\beta})^2(\sigma_1^2+\sigma_2^2)}{\Delta^2}$ where $\Delta$ is minimum detectable effect; plan for seasonality and traffic leakage when estimating.
Sequential testing & multiple comparisons: Understand peeking risk, and mitigations like alpha-spending/group-sequential or controlling false discovery rate (FDR) when running many experiments or segments.
Heterogeneous treatment effects: Predefine segments (new vs returning, city size, device) and use stratified analysis. Avoid post-hoc slicing unless treated as hypothesis generation.
Diagnosis workflow: Quick triage (rollbacks/releases, infra), funnel decomposition, segmentation, retention vs acquisition impacts, qualitative signals (support tickets), and experiment logs — iterate in that order.
Attribution & cross-contamination: Consider spillover (one user in treatment affects another), multi-device users, and cross-experiment interference in shared populations; choose randomization/unit and guardrails accordingly.
Ramp & rollout strategy: Start with a small percentage (e.g., 1–5%) for safety, monitor guardrails, then geometric or linear ramp; use feature flag rollbacks and canary principles.
Estimation heuristics: Back-of-envelope market sizing using top-down (total addressable market × penetration) or bottom-up (stores × orders/day × avg spend), and sanity-check estimates with observed platform rates.
Experiment infra expectations: As a PM, require logs of assignment, exposures, and events; insist on an experiment_metadata table (id, start/end, randomization unit, hypothesis, primary metric).
Decision framework: Combine statistical significance, practical significance (business impact), and implementation cost; a small p-value with tiny absolute lift may not justify rollout.

Worked example — diagnosing a sudden 12% drop in `DAU`

First 30 seconds: clarify when the drop began (UTC/local), affected segments (all cities or one), recent deploys, and active experiments. Declare assumptions: metric definition unchanged and event pipelines healthy. Skeleton of the investigation: (1) run sanity checks on event volumes and user identity counts; (2) check releases/rollbacks and infra incidents; (3) segment by geography, cohort, client-version, and acquisition channel; (4) inspect funnels (search → view → add) to localize where drop occurs; (5) check active experiments and feature flags for correlated changes. Flag one tradeoff explicitly: speed vs depth — prioritize rollbacks or killing experiments if guardrail breaches are evident rather than full root-cause. Close by recommending immediate short-term mitigations (kill suspect experiments, rollback recent deploy), then parallel deep dive: cohort-level funnel and session-replay/qualitative checks; if time permitted, add a synthetic user test across platforms.

A second angle — designing an A/B test to increase average tip rate

Frame success: define primary metric (mean tip rate per order), guardrails (time-to-checkout, cancellations), and unit (order vs rider vs user). Estimate sample size accounting for skewed tip distribution (consider log-transform or median as robust metric) and stratify by city and order size. Decide on ramp plan and safety thresholds for guardrails; pre-register analysis plan including adjustments for outliers and imbalanced assignment. Key difference from diagnosis: here you predefine hypotheses, required power, and operational rollout, whereas diagnosis is exploratory and often requires urgent mitigation.

Common pitfalls

Pitfall: Misattributing instrumentation drops to product regressions. Analysts often treat missing events as real metric changes; always confirm event pipeline health and identity deduping before action.

Pitfall: Anchoring on statistical significance alone. A p<0.05 result with a 0.1% lift that costs engineering investment is not automatically a go; discuss business impact and implementation complexity.

Pitfall: Over-slicing and post-hoc segmentation. Splitting data into many segments without pre-registration inflates false positives; use exploratory analyses to generate future testable hypotheses, not decisions.

Connections

Interviewers often pivot to metric design and OKRs, segmentation and personalization, or the tradeoffs between experiment velocity and production reliability (experimentation platform requirements). Be ready to connect diagnostic findings to product roadmap prioritization.

What's being tested

Core knowledge

Worked example — diagnosing a sudden 12% drop in `DAU`

A second angle — designing an A/B test to increase average tip rate

Common pitfalls

Connections

Further reading

Related concepts

What's being tested

Core knowledge

Worked example — diagnosing a sudden 12% drop in DAU

A second angle — designing an A/B test to increase average tip rate

Common pitfalls

Connections

Further reading

Related concepts

Worked example — diagnosing a sudden 12% drop in `DAU`