Evaluate a new ranking model
Company: DoorDash
Role: Data Scientist
Category: Analytics & Experimentation
Difficulty: hard
Interview Round: Onsite
A food-delivery company serves homepage store recommendations with ranking model **V1.1**. A new model **V2.0** adds several new features and may require a different feature-set configuration for treatment users.
Design an **experimentation and rollout plan** for this model upgrade. This is a two-sided marketplace: changing what the homepage shows can shift consumer demand, merchant exposure, courier utilization, and delivery ETAs — so the plan must combine product metric design, causal inference (interference / SUTVA), and operational safety, not just an A/B test on clicks.
The question is broken into seven parts. Treat them as one coherent plan: the metric, randomization, infrastructure, logging, validity threats, statistics, and launch criteria should all hang together.
### Constraints & Assumptions
- Two-sided marketplace: recommendations affect merchant demand, courier load, and delivery times, so one user's treatment can affect another user's experience (interference).
- The eligible candidate pool is constrained: a store must be **in delivery range** and **currently open** to be shown — and that eligible set changes by time and location.
- V2.0 may depend on additional features, including possibly real-time features, that V1.1 did not use. Treatment must be able to fetch a different feature bundle than control.
- Homepage serving is latency-sensitive (low single-/double-digit ms budgets per retrieval path), so any added feature computation has a latency cost.
- Assume meaningful but finite traffic — variance reduction and power planning matter; you cannot run forever.
### Clarifying Questions to Ask
- What is the company's true north — short-term orders/GMV, contribution margin, or long-term retention? This determines the primary metric.
- How large a lift do we need V2.0 to deliver to justify the added infra complexity and any latency cost (i.e., what is the practically significant effect)?
- How material is interference expected to be — does V2.0 mostly re-rank the same eligible stores, or does it change *which* stores get demand enough to move ETAs and supply?
- What feature SLAs and freshness guarantees exist, and what is the current feature-missingness/timeout rate at serving time?
- What is the baseline homepage-session→order conversion rate and current daily homepage traffic (needed for power/MDE)?
- Are there existing experimentation primitives — a bucketing service, config/feature-flag system, switchback tooling — we must build on or around?
### Part 1 — Primary success metric and guardrail metrics
Define the **primary success metric** and the important **guardrail metrics** for a homepage recommendation model in a two-sided delivery marketplace. Justify the primary metric over naive alternatives, and explain why guardrails are non-negotiable here.
```hint Where to start
Start from business value, not engagement. Ask: what action on the homepage actually creates marketplace value? Then ask what that optimization could quietly *break* on the supply/operations side.
```
```hint Pitfall to name
Explain why CTR alone is a poor primary metric (noisy, gameable, a model can raise clicks while lowering real orders), and pick a metric closer to value (e.g. orders or GMV per session). Guardrails should cover both the consumer-latency path and the marketplace/operations side (ETA, cancellations, merchant fairness).
```
#### What This Part Should Cover
- A single, business-aligned **primary** metric (e.g. orders or GMV/contribution-margin per session) with an explicit argument for why it beats CTR.
- **Guardrail** metrics spanning serving health (p95/p99 latency, timeout/error rate) AND marketplace health (delivery ETA, cancellation/refund rate, merchant-exposure concentration/fairness).
- Recognition that maximizing immediate orders can degrade ETAs, courier load balancing, merchant fairness, and long-term supply diversity.
- A short layer of secondary/diagnostic metrics (CTR, add-to-cart, reorder, basket/AOV, new-store discovery, retention) used to interpret, not decide.
### Part 2 — Unit of randomization
Choose the **unit of randomization** — user-level, session-level, geo-level, or switchback/time-based — given that recommendations can affect merchant demand, delivery times, and marketplace balance. State your default and the condition under which you'd switch.
```hint Key tension
This is fundamentally a bias-variance / SUTVA tradeoff. Finer units (user/session) give power but can violate the assumption that one unit's treatment doesn't affect another's outcome; coarser units (geo, switchback) contain interference but cost power.
```
```hint Technique to surface
Name **switchback / geo-time clustered** designs as the interference-robust option used in delivery/ride-sharing, and tie the *choice* to how much V2.0 actually moves marketplace allocation rather than picking one dogmatically.
```
#### What This Part Should Cover
- Explicit reasoning about **SUTVA / interference**: why user-level A/B can be biased when treatment changes which stores get demand.
- A comparison of the options with honest pros/cons (power vs. interference-containment; session-level contamination across variants).
- A **decision rule**: user-level when spillovers are small (modest re-rank of eligible stores); geo-time switchback or zone-clustered when the ranker materially shifts marketplace dynamics.
- Awareness of the power cost of clustering (fewer independent units, geo heterogeneity).
### Part 3 — Serving infrastructure for experiment-specific versions and feature configs
Describe how the serving infrastructure should support **experiment-specific model versions and feature-set configuration**, so control and treatment can safely fetch different feature lists. Show how you keep this safe and reproducible.
```hint Decompose the system
Separate three concerns: (1) deterministic assignment, (2) a config/registry that maps an arm to {model version, feature bundle}, and (3) safe handling when a treatment-only feature is missing at serve time.
```
```hint Safety mechanism
Before any live exposure, how do you de-risk latency and feature failures without affecting users? Think about computing V2.0 outputs without serving them.
```
#### What This Part Should Cover
- **Deterministic bucketing** (stable hash of user_id or geo-time bucket) that logs experiment ID and arm — no per-request flapping.
- An **experiment config service** mapping arm → model version + feature bundle, so features are not hard-coded in the app (control: v1.1/bundle A; treatment: v2.0/bundle B).
- A **versioned feature registry** (schema, types, freshness SLA, defaults, owners, optional/deprecated flags) and **backward-compatible serving** (safe defaults + a missingness indicator, never hard serving failures).
- **Shadow mode** before live: compute V2.0 scores in parallel to compare latency, score distribution, missingness, and calibration.
### Part 4 — Logging requirements
Specify what **events and metadata** must be logged so the experiment can be analyzed correctly and reproducibly.
```hint What to anchor on
The test is: could you exactly reconstruct what each user saw and why? Log enough to attribute outcomes to an arm, a model version, and a specific candidate list — plus the diagnostics that explain validity threats later.
```
#### What This Part Should Cover
- Assignment-level fields: experiment ID, arm, unit ID (user/session/geo-time bucket), timestamp + timezone, **model version**, **feature-config version**.
- Request/ranking-level fields: candidate set before ranking, ranked list shown, per-candidate scores where feasible, **feature-missingness/freshness indicators**, per-component latency.
- Marketplace/ranking diagnostics: eligible/serviceable pool size, fallback-triggered flag, stale-feature usage, store-level exposure.
- Downstream outcomes with attribution window: click, add-to-cart, order, basket size, cancellation.
### Part 5 — Practical validity threats
Explain how to handle **sample ratio mismatch, delayed conversions, feature missingness, novelty effects, selection bias, and spillover/interference**. For each, give a concrete diagnostic or mitigation.
```hint Triage order
Check experiment *integrity* before reading any lift — assignment/logging health first. Then handle timing (delayed conversions), then biases that conflate infra quality or eligibility shifts with model quality, then interference.
```
```hint The subtle ones
For feature missingness, ask whether you're measuring the model or the infrastructure (if treatment has more missing real-time features). For selection bias, remember the eligible store set shifts by time/place. For interference, connect back to your Part 2 randomization choice.
```
#### What This Part Should Cover
- **SRM**: define it (planned vs observed split), what it signals (assignment/logging bugs, treatment-induced crashes, geo routing), and that you do not trust lift until it's resolved.
- **Delayed conversions**: an attribution window and analysis window; why reading too early biases toward click-heavy variants.
- **Feature missingness**: missingness indicators, freshness logging, and slicing by missing vs non-missing so you don't confuse infra quality with model quality.
- **Novelty effects, selection bias, spillover/interference**: monitor for novelty decay over time; log eligible-pool composition so eligibility shifts aren't read as lift; mitigate interference with switchback/zone clustering and marketplace-level outcome monitoring.
### Part 6 — Power / MDE and variance reduction
Explain how to estimate **power / MDE**, and when **stratification** or **CUPED** would help. Show the quantitative reasoning, including how clustering changes the math.
```hint Formula to reach for
For a binary metric, relate sample size to baseline rate $p$ and absolute detectable lift $\delta$ via the standard two-proportion sample-size approximation, then adjust for clustered designs with a **design effect** based on cluster size and intra-cluster correlation.
```
```hint Variance reduction
CUPED uses a pre-period covariate correlated with the outcome (e.g. prior order count) to subtract predictable variance. Think about which pre-experiment covariates here are both available and predictive.
```
#### What This Part Should Cover
- A power/MDE estimate for a binary metric using something like $n \approx 16\,p(1-p)/\delta^2$ per arm, worked on a concrete baseline (e.g. $p = 0.10$, $\delta = 0.005$).
- The **design effect** for clustered/switchback designs: $n_{\text{eff}} = n / (1 + (m-1)\cdot \text{ICC})$, and why geo-level tests need more traffic/duration.
- **CUPED**: the adjustment $Y_{\text{adj}} = Y - \theta\,(X - \bar X)$, what $\theta$ is, useful covariates (prior 7-day orders, prior sessions, pre-period spend), and the payoff (lower variance → smaller MDE → shorter test).
- **Stratification**: which slices matter (new vs returning, dense vs sparse markets, supply conditions, platform), and the Simpson's-paradox risk if the traffic mix differs across arms.
### Part 7 — Ramping, rollback, and launch decision
Define the criteria for **ramping, rollback, and final launch**. Give the ramp sequence, explicit rollback triggers, and a multi-factor launch decision (not just "is lift positive?").
```hint Sequence
Stage exposure so you catch operational failures before statistical ones: offline validation → shadow → canary ramp → full experiment. Pair each stage with what it's checking for.
```
#### What This Part Should Cover
- A **ramp sequence**: offline replay/backtest with point-in-time features (NDCG, log loss, calibration) → shadow mode → canary ramp (e.g. 1%→5%→25%→50%) → full experiment over complete weekly cycles.
- **Rollback triggers**: latency regression, ETA/cancellation degradation, elevated missingness/timeouts, SRM/logging corruption, severe merchant-exposure skew.
- A **launch decision framework**: primary metric significant AND guardrails intact AND robust across key slices AND gain large enough to justify infra/latency cost AND not dependent on fragile real-time features AND not just novelty.
### What a Strong Answer Covers
Across all seven parts, a strong answer should read as one coherent plan rather than seven disconnected checklists. Cross-cutting signals the interviewer is looking for:
- **Internal consistency** — the Part 2 randomization choice, the Part 4 logging, the Part 5 interference handling, and the Part 6 power math reinforce each other (e.g. choosing switchback implies a design-effect penalty and marketplace-level logging).
- **Marketplace/causal sophistication** — treats interference, SUTVA, and two-sided effects as first-class, not an afterthought.
- **Operational safety as a co-equal goal** — guardrails, shadow mode, and staged rollout are integral, not bolt-ons.
- **Quantitative grounding** — concrete metric, baseline, MDE, and a defensible launch bar.
- **Judgment over dogma** — defaults with explicit switch conditions, and a launch decision that weighs lift against complexity, latency, and fragility.
### Follow-up Questions
- Suppose user-level SRM is clean but the geo-time switchback shows a strong day-part interaction (treatment wins at lunch, loses at dinner). How do you decide whether to launch, and to whom?
- V2.0's lift comes mostly from a single real-time feature with a 2% serving timeout rate. How do you quantify how much of the measured lift is the model versus infra quality, and what would you require before launch?
- Online conversion is up +0.4% but offline NDCG was flat. How do you reconcile this, and which do you trust?
- After full launch, the lift decays over three weeks toward zero. How do you distinguish a novelty effect from a genuine regression introduced by ramping, and what experiment would you run to find out?
Quick Answer: This question evaluates expertise in experimentation design and causal inference within two-sided marketplace environments. It tests the ability to handle interference, SUTVA violations, metric selection, and rollout safety when deploying ranking model upgrades — core competencies for data scientist roles focused on product analytics and A/B testing.