Evaluate Biker Feature Success
Company: DoorDash
Role: Data Scientist
Category: Analytics & Experimentation
Difficulty: hard
Interview Round: Onsite
DoorDash is considering launching **Biker Mode**, a feature for Dashers who deliver by bicycle. Biker Mode may help bicycle Dashers identify suitable short-distance orders, receive bike-friendly routing, and get matched to deliveries where bicycles are operationally efficient.
You are the data scientist owning the launch. Product, operations, and the dispatch team want a single recommendation — **launch, iterate, or roll back** — backed by a measurement plan they can trust. Work through the five questions in the Parts below.
### Constraints & Assumptions
- DoorDash is a **two-sided marketplace**: a finite pool of Dashers (bike and car) is matched to incoming orders by a dispatch/assignment system. Biker Mode plausibly touches **eligibility, routing, and matching** (which Dasher gets which order), not just a Dasher-facing UI. State your assumption explicitly if you proceed differently.
- The feature is meaningful only for **short-distance orders in dense zones** with available bike supply (urban cores). Most DoorDash orders are *not* bike-eligible.
- **Scale.** DoorDash operates across thousands of zones; a given dense zone sees on the order of hundreds to low-thousands of orders per day. You may pick reasonable numbers and label them as assumptions.
- You have standard marketplace telemetry: order-level delivery times, offer/assignment logs, Dasher online/active hours, earnings, cancellations, ratings, and per-zone/per-hour supply-demand data.
- You do **not** need a final powered sample size, but you should be able to sketch how you would size the test and what inflates that number.
### Clarifying Questions to Ask
- Does Biker Mode change dispatch/assignment (who gets which order), or is it purely a UI + routing layer the individual Dasher sees? This determines whether interference is a first-order concern.
- Is the business goal **incremental throughput** (more deliveries completed), or **substitution** toward a cheaper/greener mode at equal throughput? This changes the primary metric.
- What is the intended eligible population — which zones, distance thresholds, and dayparts count as "bike-suitable"?
- Is there meaningful cross-zone delivery, or are zones effectively self-contained for matching purposes?
- Are Dasher safety and earnings already instrumented well enough to serve as guardrails?
- What is the rollout horizon and appetite for risk — a quick read or a durable, multi-week effect?
### Part 1 — Define success
How would you define success for the Biker Mode feature? Give a single, defensible success statement and say explicitly what it is deliberately *not* counting as success.
```hint What to be suspicious of
Adoption (opt-in rate, orders done by bike) is a *leading indicator*, not success. Ask whether a metric could rise while the overall marketplace is no better off.
```
```hint Whose value are you counting?
Anchor the statement on net value *per unit of supply* across the whole local marketplace. Make sure it would reject a result that merely reshuffles work between Dasher types while total throughput and customer reliability stay flat.
```
#### What a Strong Answer Covers
- Frames success as **net marketplace value per unit of supply** (e.g. completed deliveries or contribution profit per active supply-hour), not bike volume.
- Explicitly distinguishes **adoption from success** and names what is *not* counted (opt-in rate, raw bike order count, a lift that is really a zero-sum transfer).
- Bakes the **anti-cannibalization** condition into the statement — it should fail a result that merely moves orders from car to bike Dashers at flat total throughput and flat customer reliability.
### Part 2 — Metrics
What **primary**, **secondary**, and **guardrail** metrics would you track? Be explicit about why each metric sits in its tier.
```hint Keep the primary singular
A single primary metric avoids multiple-comparison gaming. Prefer a *ratio* that captures productive use of supply and that cannot be inflated by simply doing more low-value trips.
```
```hint Guardrails are multi-stakeholder
A marketplace feature can win for one side and lose for another. Think about who could be hurt — customers, bike Dashers, **car Dashers**, merchants, the marketplace overall — and give each at least one guardrail that can veto a launch. Which single guardrail catches the zero-sum cannibalization failure?
```
#### What a Strong Answer Covers
- One **singular primary** that is a gaming-resistant ratio (productive use of supply), with the throughput-vs-substitution framing reflected in the choice.
- A coherent **secondary/diagnostic** set that would *explain why the primary moved* (funnel, utilization, match quality), not just more outcome metrics.
- **Multi-stakeholder guardrails** (customer, bike Dasher, car Dasher, merchant, marketplace), each able to veto a launch, including an explicit **total-throughput / car-Dasher cannibalization** guardrail.
### Part 3 — Experiment design
How would you design an experiment to estimate the **causal** impact of Biker Mode? Cover the eligible population, the unit of randomization, your hypotheses, the estimator, and how you would size and time the test.
```hint The decision that drives everything
The unit of randomization is the crux. Ask: if a treated bike Dasher takes an order, does that *change* what a control Dasher experiences? If yes, a naive Dasher- or order-level A/B violates SUTVA and the estimate is biased.
```
```hint Make the interference happen inside a unit
If a naive split is contaminated, you need a randomization unit big enough that the interference is contained *within* one unit instead of leaking across arms. What is the smallest unit that still satisfies that? What new artifact appears at the boundary between consecutive on/off periods if you toggle the same place over time, and what does a bigger unit cost you in independent samples and power?
```
```hint Estimate on the right population
Testing across *all* orders dilutes a real effect to undetectable — define and analyze the eligible population only. Then ask: should your headline number be computed over everyone *assigned* to treatment, or only over those who actually *used* the feature — and which of those two choices quietly lets the keenest Dashers select themselves in? Finally, once your unit of randomization is a whole place-and-time block rather than one order, are the orders inside a block independent — and what does that do to the sample size a textbook formula hands you?
```
#### What a Strong Answer Covers
- Defines and analyzes the **eligible population** (dense zones, short distances, bike-supply dayparts) rather than testing across all orders.
- Correctly identifies the **interference / SUTVA** problem and justifies a randomization unit that *contains* the interference (switchback or zone cluster), instead of defaulting to a Dasher/order-level A/B.
- States clean **hypotheses and an estimand**, defaults to **ITT** with TOT/CACE as a secondary read, and uses **CUPED** or pre-period covariates for variance reduction.
- Computes **standard errors at the level randomization occurred** (cluster-robust) and sizes the test with an **MDE**, inflating the naive count by the **design effect** from clustering.
### Part 4 — Biases, confounders, interference, edge cases
What biases, confounders, marketplace-interference issues, or edge cases would you watch for, and how would you mitigate each?
```hint Walk the pipeline, not a checklist
Don't just list textbook biases. Trace one order from *who self-selects into the feature*, through *what the design itself introduces*, to *what you do at analysis time with many metrics and repeated looks* — a distinct threat hides at each stage. For every one you name, attach a concrete mitigation in the design or analysis; a threat without a fix earns little credit.
```
#### What a Strong Answer Covers
- A **threats inventory that spans the pipeline**: selection/opt-in bias, design-induced threats (interference, switchback carryover, weather/seasonality, mix-shift/Simpson's paradox), and analysis-time threats (peeking, multiplicity, noncompliance).
- A **concrete mitigation paired with each threat**, located in the design or the analysis — not a bare list of names.
- Explicit treatment of **cannibalization / spillover** as a first-order marketplace risk, not just a generic confounder.
### Part 5 — Launch / iterate / roll-back decision
How would you decide whether to **launch**, **iterate**, or **roll back** the feature based on the results?
```hint More than a p-value
A defensible rule combines statistical significance, *practical* significance (a pre-registered MDE), guardrail status, and operational feasibility. Map concrete result patterns — clean win, segment-specific or fair-weather-only win, low-adoption-but-good-adopters, cannibalization, reliability degradation — to each of the three actions.
```
#### What a Strong Answer Covers
- A decision rule combining **statistical significance + practical significance (pre-registered MDE) + guardrail status + operational feasibility**, not p-values alone.
- A **mapping of concrete result patterns to actions** (clean win → launch; segment- or weather-specific win or a single fixable guardrail → iterate / gated ship; cannibalization or reliability degradation → roll back).
- A **staged post-launch plan** (ramp with a holdout, continued guardrail monitoring) acknowledging that some effects only surface over weeks.
### What a Strong Answer Covers
These dimensions span all five Parts:
- Treats the problem as a **two-sided marketplace** question end to end — every Part reflects that adding or steering bike supply must improve the whole local marketplace, not just bike activity.
- Maintains internal consistency: the success statement (Part 1), the primary metric and cannibalization guardrail (Part 2), the interference-aware design (Part 3), and the decision rule (Part 5) all point at the same notion of net value per unit of supply.
- Communicates assumptions and trade-offs explicitly, stating where a clarification (UI-only vs. dispatch-touching; throughput vs. substitution) would change the design.
### Follow-up Questions
- Your switchback shows a positive primary effect, but car-Dasher earnings drop in the same zones while total throughput is flat. What do you conclude and recommend?
- The aggregate primary metric improves, yet within every individual zone-hour cell it is flat. What is happening, and which number do you trust?
- How would you detect and quantify spillover onto neighboring zones if cross-zone deliveries exist?
- The effect is positive only in fair weather and only downtown. Is that a launch, an iterate, or a roll-back, and how would you ship it?
- After launch, which of your metrics would you expect to take weeks (not days) to stabilize, and how would you keep monitoring them?
Quick Answer: This question evaluates experimentation and product analytics skills—metric design, causal inference, and marketplace impact assessment—in the context of a two-sided delivery marketplace, with attention to eligibility, routing, and matching effects.