Statistical Power Analysis

What's being tested

Statistical power analysis in these interviews is about whether you can design an experiment that has a realistic chance of detecting a business-relevant effect before DoorDash spends weeks running it. The interviewer is probing whether you understand the link between effect size, variance, sample size, randomization unit, and experiment duration, especially in marketplace settings with interference between consumers, Dashers, and merchants. You are expected to reason beyond a simple user-level A/B test: switchbacks, geo launches, notification campaigns, courier initiatives, and revenue algorithms all create clustering, contamination, seasonality, and heterogeneous treatment effects. DoorDash cares because underpowered tests waste marketplace capacity, while overpowered or poorly framed tests can ship statistically significant but operationally meaningless changes.

Core knowledge

Power is the probability of detecting a true effect of a specified size: $1 - \beta$ , commonly set to 80% or 90%. The standard test design also chooses a Type I error rate $\alpha$ , often 5%, and a practically meaningful minimum detectable effect.
MDE should come from product or business relevance, not from whatever the experiment can conveniently detect. For DoorDash, a 0.1% lift in orders_per_consumer may matter at scale, while a 0.1% lift in Dasher_acceptance_rate may be meaningless if it worsens on_time_delivery_rate.
For a two-sample difference in means with equal allocation, approximate per-arm sample size is:
$n \approx \frac{2\sigma^2(z_{1-\alpha/2}+z_{1-\beta})^2}{\delta^2}$
where $\sigma^2$ is outcome variance and $\delta$ is the absolute MDE. Binary metrics use $\sigma^2=p(1-p)$ .
Relative MDE and absolute MDE must be translated carefully. If baseline conversion_rate is 20%, a 2% relative lift means $\delta=0.004$ absolute, not 0.02. Many weak interview answers overstate power by confusing percentage points with percent lift.
Randomization unit determines effective sample size. User-level randomization works for independent consumer experiences, but marketplace experiments often need cluster randomization by geo, store, Dasher, or time block to reduce interference. More rows in the dataset do not imply more independent observations.
Clustered designs inflate variance through the intra-cluster correlation. A common design effect is:
$DE = 1 + (m-1)\rho$
where $m$ is average cluster size and $\rho$ is the intra-cluster correlation. Effective sample size is roughly $n_\text{raw}/DE$ , which can drastically reduce power for geo or store-level tests.
Switchback experiments randomize treatment by market-time blocks, such as city-hour or zone-day. Power depends on the number of randomized blocks, not just total orders. Analysis should usually include fixed effects for market and time, plus clustered standard errors at the randomization level.
Pre-period covariates can improve power through CUPED or regression adjustment. If a covariate $X$ explains outcome variance, adjusted variance becomes approximately $\sigma_Y^2(1-\rho_{YX}^2)$ . This is valuable for noisy metrics like gross_order_value, delivery_time, and retention.
Triggering affects power and interpretation. If only 10% of consumers receive an extra notification, analyzing all assigned users estimates the intent-to-treat effect and may be diluted. A triggered analysis can improve sensitivity but risks post-treatment selection bias unless eligibility is defined pre-randomization.
Ratio metrics such as revenue_per_order, orders_per_active_user, or cost_per_delivery need careful variance estimation. Delta method, bootstrap, or user-level aggregation is often safer than treating numerator and denominator events as independent.
Multiple metrics create multiple testing risk. DoorDash experiments usually have one primary metric, guardrails such as refund_rate, lateness, and Dasher_pay, and diagnostic secondary metrics. Use hierarchy, pre-registration, or corrections like Bonferroni or Benjamini-Hochberg when many claims are made.
Duration must cover temporal cycles. Even if sample size is reached in two days, marketplace metrics often vary by weekday, lunch/dinner peaks, weather, holidays, and promotions. A credible design often runs at least one or two full weekly cycles unless the intervention is extremely short-lived.

Worked example

For Design and analyze a switchback experiment, a strong candidate would start by clarifying the intervention, the marketplace side affected, the primary metric, and the likely interference pattern: “Are we changing dispatch logic, pricing, batching, or Dasher incentives, and does treatment spill over across nearby zones or time windows?” They would state that user-level randomization may be invalid if supply is shared, so they would propose randomizing market-time blocks, such as zone_id × hour or market × meal_period. The answer would be organized around four pillars: define success metrics and guardrails, choose the randomization unit, calculate power/MDE using block-level variance, and specify the analysis model.

For the model, they might propose a regression like $Y_{zt}=\alpha_z+\gamma_t+\tau T_{zt}+\epsilon_{zt}$ , where $\alpha_z$ are zone fixed effects, $\gamma_t$ are time fixed effects, and standard errors are clustered by zone or switchback block depending on assignment. They would explain that power comes from the number of randomized blocks and the within-zone over-time variation, not merely the number of deliveries. A key tradeoff is granularity: shorter blocks increase sample size but raise carryover risk if Dashers reposition or orders started in one block finish in another. They would close by saying that, with more time, they would simulate power from historical delivery_time, fulfillment_rate, and order_volume data to account for skew, seasonality, and cluster correlation instead of relying only on a normal approximation.

A second angle

For Analyze Retention Data for Geo-Targeted Feature Launch, the same power logic applies, but the constraint shifts from time-block randomization to a small number of geographies. The effective sample size is closer to the number of markets than the number of consumers, especially if local marketing, merchant mix, or Dasher supply shocks affect everyone in a region. A strong design would use matched geo pairs, pre-period retention trends, and possibly a difference-in-differences estimator to reduce variance. Power should be based on historical market-level retention volatility, not individual-level binomial variance. The candidate should also discuss that retention curves create multiple endpoints, such as D7_retention, D28_retention, and repeat-order rate, so the primary endpoint must be chosen before launch.

Common pitfalls

Pitfall: Treating rows as independent observations.

A tempting answer is, “We have millions of orders, so the experiment will be highly powered.” That fails in switchback, geo, courier, and marketplace tests because the independent unit may be a zone-hour, market, Dasher, or user cohort. A better answer explicitly identifies the randomization unit and computes power using variance at that level.

Pitfall: Starting with the formula before defining the decision.

Interviewers do not want a memorized sample-size equation detached from the business question. If the candidate cannot say what MDE matters for revenue, retention, delivery_quality, or Dasher_experience, the math is not useful. Start with the decision the experiment must support, then translate that into an effect size and duration.

Pitfall: Ignoring dilution and noncompliance.

For notification experiments or courier initiatives, not everyone assigned to treatment will actually receive or comply with the treatment. Reporting only a per-exposed lift can exaggerate impact if exposure is post-treatment. A stronger answer distinguishes intent-to-treat, treatment-on-the-treated, eligibility criteria, and how each affects power and causal interpretation.

Connections

Interviewers often pivot from power analysis into causal inference, especially difference-in-differences, instrumental variables, or synthetic control for geo launches. They may also probe metric design, experiment guardrails, sequential testing, variance reduction, or heterogeneous treatment effects across markets, merchants, consumers, and Dashers.

What's being tested

Core knowledge

Worked example

A second angle

Common pitfalls

Connections

Further reading

Featured in interview prep guides

Practice questions

Related concepts