PracHub
QuestionsCoachesLearningGuidesInterview Prep
|Home/Analytics & Experimentation/DoorDash

Evaluate a new ranking model

Last updated: Jun 25, 2026

Quick Overview

This question evaluates expertise in experimentation design and causal inference within two-sided marketplace environments. It tests the ability to handle interference, SUTVA violations, metric selection, and rollout safety when deploying ranking model upgrades — core competencies for data scientist roles focused on product analytics and A/B testing.

  • hard
  • DoorDash
  • Analytics & Experimentation
  • Data Scientist

Evaluate a new ranking model

Company: DoorDash

Role: Data Scientist

Category: Analytics & Experimentation

Difficulty: hard

Interview Round: Onsite

A food-delivery company serves homepage store recommendations with ranking model **V1.1**. A new model **V2.0** adds several new features and may require a different feature-set configuration for treatment users. Design an **experimentation and rollout plan** for this model upgrade. This is a two-sided marketplace: changing what the homepage shows can shift consumer demand, merchant exposure, courier utilization, and delivery ETAs — so the plan must combine product metric design, causal inference (interference / SUTVA), and operational safety, not just an A/B test on clicks. The question is broken into seven parts. Treat them as one coherent plan: the metric, randomization, infrastructure, logging, validity threats, statistics, and launch criteria should all hang together. ### Constraints & Assumptions - Two-sided marketplace: recommendations affect merchant demand, courier load, and delivery times, so one user's treatment can affect another user's experience (interference). - The eligible candidate pool is constrained: a store must be **in delivery range** and **currently open** to be shown — and that eligible set changes by time and location. - V2.0 may depend on additional features, including possibly real-time features, that V1.1 did not use. Treatment must be able to fetch a different feature bundle than control. - Homepage serving is latency-sensitive (low single-/double-digit ms budgets per retrieval path), so any added feature computation has a latency cost. - Assume meaningful but finite traffic — variance reduction and power planning matter; you cannot run forever. ### Clarifying Questions to Ask - What is the company's true north — short-term orders/GMV, contribution margin, or long-term retention? This determines the primary metric. - How large a lift do we need V2.0 to deliver to justify the added infra complexity and any latency cost (i.e., what is the practically significant effect)? - How material is interference expected to be — does V2.0 mostly re-rank the same eligible stores, or does it change *which* stores get demand enough to move ETAs and supply? - What feature SLAs and freshness guarantees exist, and what is the current feature-missingness/timeout rate at serving time? - What is the baseline homepage-session→order conversion rate and current daily homepage traffic (needed for power/MDE)? - Are there existing experimentation primitives — a bucketing service, config/feature-flag system, switchback tooling — we must build on or around? ### Part 1 — Primary success metric and guardrail metrics Define the **primary success metric** and the important **guardrail metrics** for a homepage recommendation model in a two-sided delivery marketplace. Justify the primary metric over naive alternatives, and explain why guardrails are non-negotiable here. ```hint Where to start Start from business value, not engagement. Ask: what action on the homepage actually creates marketplace value? Then ask what that optimization could quietly *break* on the supply/operations side. ``` ```hint Pitfall to name Explain why CTR alone is a poor primary metric (noisy, gameable, a model can raise clicks while lowering real orders), and pick a metric closer to value (e.g. orders or GMV per session). Guardrails should cover both the consumer-latency path and the marketplace/operations side (ETA, cancellations, merchant fairness). ``` #### What This Part Should Cover - A single, business-aligned **primary** metric (e.g. orders or GMV/contribution-margin per session) with an explicit argument for why it beats CTR. - **Guardrail** metrics spanning serving health (p95/p99 latency, timeout/error rate) AND marketplace health (delivery ETA, cancellation/refund rate, merchant-exposure concentration/fairness). - Recognition that maximizing immediate orders can degrade ETAs, courier load balancing, merchant fairness, and long-term supply diversity. - A short layer of secondary/diagnostic metrics (CTR, add-to-cart, reorder, basket/AOV, new-store discovery, retention) used to interpret, not decide. ### Part 2 — Unit of randomization Choose the **unit of randomization** — user-level, session-level, geo-level, or switchback/time-based — given that recommendations can affect merchant demand, delivery times, and marketplace balance. State your default and the condition under which you'd switch. ```hint Key tension This is fundamentally a bias-variance / SUTVA tradeoff. Finer units (user/session) give power but can violate the assumption that one unit's treatment doesn't affect another's outcome; coarser units (geo, switchback) contain interference but cost power. ``` ```hint Technique to surface Name **switchback / geo-time clustered** designs as the interference-robust option used in delivery/ride-sharing, and tie the *choice* to how much V2.0 actually moves marketplace allocation rather than picking one dogmatically. ``` #### What This Part Should Cover - Explicit reasoning about **SUTVA / interference**: why user-level A/B can be biased when treatment changes which stores get demand. - A comparison of the options with honest pros/cons (power vs. interference-containment; session-level contamination across variants). - A **decision rule**: user-level when spillovers are small (modest re-rank of eligible stores); geo-time switchback or zone-clustered when the ranker materially shifts marketplace dynamics. - Awareness of the power cost of clustering (fewer independent units, geo heterogeneity). ### Part 3 — Serving infrastructure for experiment-specific versions and feature configs Describe how the serving infrastructure should support **experiment-specific model versions and feature-set configuration**, so control and treatment can safely fetch different feature lists. Show how you keep this safe and reproducible. ```hint Decompose the system Separate three concerns: (1) deterministic assignment, (2) a config/registry that maps an arm to {model version, feature bundle}, and (3) safe handling when a treatment-only feature is missing at serve time. ``` ```hint Safety mechanism Before any live exposure, how do you de-risk latency and feature failures without affecting users? Think about computing V2.0 outputs without serving them. ``` #### What This Part Should Cover - **Deterministic bucketing** (stable hash of user_id or geo-time bucket) that logs experiment ID and arm — no per-request flapping. - An **experiment config service** mapping arm → model version + feature bundle, so features are not hard-coded in the app (control: v1.1/bundle A; treatment: v2.0/bundle B). - A **versioned feature registry** (schema, types, freshness SLA, defaults, owners, optional/deprecated flags) and **backward-compatible serving** (safe defaults + a missingness indicator, never hard serving failures). - **Shadow mode** before live: compute V2.0 scores in parallel to compare latency, score distribution, missingness, and calibration. ### Part 4 — Logging requirements Specify what **events and metadata** must be logged so the experiment can be analyzed correctly and reproducibly. ```hint What to anchor on The test is: could you exactly reconstruct what each user saw and why? Log enough to attribute outcomes to an arm, a model version, and a specific candidate list — plus the diagnostics that explain validity threats later. ``` #### What This Part Should Cover - Assignment-level fields: experiment ID, arm, unit ID (user/session/geo-time bucket), timestamp + timezone, **model version**, **feature-config version**. - Request/ranking-level fields: candidate set before ranking, ranked list shown, per-candidate scores where feasible, **feature-missingness/freshness indicators**, per-component latency. - Marketplace/ranking diagnostics: eligible/serviceable pool size, fallback-triggered flag, stale-feature usage, store-level exposure. - Downstream outcomes with attribution window: click, add-to-cart, order, basket size, cancellation. ### Part 5 — Practical validity threats Explain how to handle **sample ratio mismatch, delayed conversions, feature missingness, novelty effects, selection bias, and spillover/interference**. For each, give a concrete diagnostic or mitigation. ```hint Triage order Check experiment *integrity* before reading any lift — assignment/logging health first. Then handle timing (delayed conversions), then biases that conflate infra quality or eligibility shifts with model quality, then interference. ``` ```hint The subtle ones For feature missingness, ask whether you're measuring the model or the infrastructure (if treatment has more missing real-time features). For selection bias, remember the eligible store set shifts by time/place. For interference, connect back to your Part 2 randomization choice. ``` #### What This Part Should Cover - **SRM**: define it (planned vs observed split), what it signals (assignment/logging bugs, treatment-induced crashes, geo routing), and that you do not trust lift until it's resolved. - **Delayed conversions**: an attribution window and analysis window; why reading too early biases toward click-heavy variants. - **Feature missingness**: missingness indicators, freshness logging, and slicing by missing vs non-missing so you don't confuse infra quality with model quality. - **Novelty effects, selection bias, spillover/interference**: monitor for novelty decay over time; log eligible-pool composition so eligibility shifts aren't read as lift; mitigate interference with switchback/zone clustering and marketplace-level outcome monitoring. ### Part 6 — Power / MDE and variance reduction Explain how to estimate **power / MDE**, and when **stratification** or **CUPED** would help. Show the quantitative reasoning, including how clustering changes the math. ```hint Formula to reach for For a binary metric, relate sample size to baseline rate $p$ and absolute detectable lift $\delta$ via the standard two-proportion sample-size approximation, then adjust for clustered designs with a **design effect** based on cluster size and intra-cluster correlation. ``` ```hint Variance reduction CUPED uses a pre-period covariate correlated with the outcome (e.g. prior order count) to subtract predictable variance. Think about which pre-experiment covariates here are both available and predictive. ``` #### What This Part Should Cover - A power/MDE estimate for a binary metric using something like $n \approx 16\,p(1-p)/\delta^2$ per arm, worked on a concrete baseline (e.g. $p = 0.10$, $\delta = 0.005$). - The **design effect** for clustered/switchback designs: $n_{\text{eff}} = n / (1 + (m-1)\cdot \text{ICC})$, and why geo-level tests need more traffic/duration. - **CUPED**: the adjustment $Y_{\text{adj}} = Y - \theta\,(X - \bar X)$, what $\theta$ is, useful covariates (prior 7-day orders, prior sessions, pre-period spend), and the payoff (lower variance → smaller MDE → shorter test). - **Stratification**: which slices matter (new vs returning, dense vs sparse markets, supply conditions, platform), and the Simpson's-paradox risk if the traffic mix differs across arms. ### Part 7 — Ramping, rollback, and launch decision Define the criteria for **ramping, rollback, and final launch**. Give the ramp sequence, explicit rollback triggers, and a multi-factor launch decision (not just "is lift positive?"). ```hint Sequence Stage exposure so you catch operational failures before statistical ones: offline validation → shadow → canary ramp → full experiment. Pair each stage with what it's checking for. ``` #### What This Part Should Cover - A **ramp sequence**: offline replay/backtest with point-in-time features (NDCG, log loss, calibration) → shadow mode → canary ramp (e.g. 1%→5%→25%→50%) → full experiment over complete weekly cycles. - **Rollback triggers**: latency regression, ETA/cancellation degradation, elevated missingness/timeouts, SRM/logging corruption, severe merchant-exposure skew. - A **launch decision framework**: primary metric significant AND guardrails intact AND robust across key slices AND gain large enough to justify infra/latency cost AND not dependent on fragile real-time features AND not just novelty. ### What a Strong Answer Covers Across all seven parts, a strong answer should read as one coherent plan rather than seven disconnected checklists. Cross-cutting signals the interviewer is looking for: - **Internal consistency** — the Part 2 randomization choice, the Part 4 logging, the Part 5 interference handling, and the Part 6 power math reinforce each other (e.g. choosing switchback implies a design-effect penalty and marketplace-level logging). - **Marketplace/causal sophistication** — treats interference, SUTVA, and two-sided effects as first-class, not an afterthought. - **Operational safety as a co-equal goal** — guardrails, shadow mode, and staged rollout are integral, not bolt-ons. - **Quantitative grounding** — concrete metric, baseline, MDE, and a defensible launch bar. - **Judgment over dogma** — defaults with explicit switch conditions, and a launch decision that weighs lift against complexity, latency, and fragility. ### Follow-up Questions - Suppose user-level SRM is clean but the geo-time switchback shows a strong day-part interaction (treatment wins at lunch, loses at dinner). How do you decide whether to launch, and to whom? - V2.0's lift comes mostly from a single real-time feature with a 2% serving timeout rate. How do you quantify how much of the measured lift is the model versus infra quality, and what would you require before launch? - Online conversion is up +0.4% but offline NDCG was flat. How do you reconcile this, and which do you trust? - After full launch, the lift decays over three weeks toward zero. How do you distinguish a novelty effect from a genuine regression introduced by ramping, and what experiment would you run to find out?

Quick Answer: This question evaluates expertise in experimentation design and causal inference within two-sided marketplace environments. It tests the ability to handle interference, SUTVA violations, metric selection, and rollout safety when deploying ranking model upgrades — core competencies for data scientist roles focused on product analytics and A/B testing.

Related Interview Questions

  • Evaluate Biker Feature Success - DoorDash (hard)
  • How would you test product changes? - DoorDash (hard)
  • How to test bike delivery? - DoorDash (medium)
  • Investigate LA successful orders drop - DoorDash (easy)
  • How would you diagnose a completed orders drop? - DoorDash (easy)
|Home/Analytics & Experimentation/DoorDash

Evaluate a new ranking model

DoorDash logo
DoorDash
Feb 6, 2026, 12:00 AM
hardData ScientistOnsiteAnalytics & Experimentation
14
0

A food-delivery company serves homepage store recommendations with ranking model V1.1. A new model V2.0 adds several new features and may require a different feature-set configuration for treatment users.

Design an experimentation and rollout plan for this model upgrade. This is a two-sided marketplace: changing what the homepage shows can shift consumer demand, merchant exposure, courier utilization, and delivery ETAs — so the plan must combine product metric design, causal inference (interference / SUTVA), and operational safety, not just an A/B test on clicks.

The question is broken into seven parts. Treat them as one coherent plan: the metric, randomization, infrastructure, logging, validity threats, statistics, and launch criteria should all hang together.

Constraints & Assumptions

  • Two-sided marketplace: recommendations affect merchant demand, courier load, and delivery times, so one user's treatment can affect another user's experience (interference).
  • The eligible candidate pool is constrained: a store must be in delivery range and currently open to be shown — and that eligible set changes by time and location.
  • V2.0 may depend on additional features, including possibly real-time features, that V1.1 did not use. Treatment must be able to fetch a different feature bundle than control.
  • Homepage serving is latency-sensitive (low single-/double-digit ms budgets per retrieval path), so any added feature computation has a latency cost.
  • Assume meaningful but finite traffic — variance reduction and power planning matter; you cannot run forever.

Clarifying Questions to Ask

  • What is the company's true north — short-term orders/GMV, contribution margin, or long-term retention? This determines the primary metric.
  • How large a lift do we need V2.0 to deliver to justify the added infra complexity and any latency cost (i.e., what is the practically significant effect)?
  • How material is interference expected to be — does V2.0 mostly re-rank the same eligible stores, or does it change which stores get demand enough to move ETAs and supply?
  • What feature SLAs and freshness guarantees exist, and what is the current feature-missingness/timeout rate at serving time?
  • What is the baseline homepage-session→order conversion rate and current daily homepage traffic (needed for power/MDE)?
  • Are there existing experimentation primitives — a bucketing service, config/feature-flag system, switchback tooling — we must build on or around?

Part 1 — Primary success metric and guardrail metrics

Define the primary success metric and the important guardrail metrics for a homepage recommendation model in a two-sided delivery marketplace. Justify the primary metric over naive alternatives, and explain why guardrails are non-negotiable here.

What This Part Should Cover

  • A single, business-aligned primary metric (e.g. orders or GMV/contribution-margin per session) with an explicit argument for why it beats CTR.
  • Guardrail metrics spanning serving health (p95/p99 latency, timeout/error rate) AND marketplace health (delivery ETA, cancellation/refund rate, merchant-exposure concentration/fairness).
  • Recognition that maximizing immediate orders can degrade ETAs, courier load balancing, merchant fairness, and long-term supply diversity.
  • A short layer of secondary/diagnostic metrics (CTR, add-to-cart, reorder, basket/AOV, new-store discovery, retention) used to interpret, not decide.

Part 2 — Unit of randomization

Choose the unit of randomization — user-level, session-level, geo-level, or switchback/time-based — given that recommendations can affect merchant demand, delivery times, and marketplace balance. State your default and the condition under which you'd switch.

What This Part Should Cover

  • Explicit reasoning about SUTVA / interference : why user-level A/B can be biased when treatment changes which stores get demand.
  • A comparison of the options with honest pros/cons (power vs. interference-containment; session-level contamination across variants).
  • A decision rule : user-level when spillovers are small (modest re-rank of eligible stores); geo-time switchback or zone-clustered when the ranker materially shifts marketplace dynamics.
  • Awareness of the power cost of clustering (fewer independent units, geo heterogeneity).

Part 3 — Serving infrastructure for experiment-specific versions and feature configs

Describe how the serving infrastructure should support experiment-specific model versions and feature-set configuration, so control and treatment can safely fetch different feature lists. Show how you keep this safe and reproducible.

What This Part Should Cover

  • Deterministic bucketing (stable hash of user_id or geo-time bucket) that logs experiment ID and arm — no per-request flapping.
  • An experiment config service mapping arm → model version + feature bundle, so features are not hard-coded in the app (control: v1.1/bundle A; treatment: v2.0/bundle B).
  • A versioned feature registry (schema, types, freshness SLA, defaults, owners, optional/deprecated flags) and backward-compatible serving (safe defaults + a missingness indicator, never hard serving failures).
  • Shadow mode before live: compute V2.0 scores in parallel to compare latency, score distribution, missingness, and calibration.

Part 4 — Logging requirements

Specify what events and metadata must be logged so the experiment can be analyzed correctly and reproducibly.

What This Part Should Cover

  • Assignment-level fields: experiment ID, arm, unit ID (user/session/geo-time bucket), timestamp + timezone, model version , feature-config version .
  • Request/ranking-level fields: candidate set before ranking, ranked list shown, per-candidate scores where feasible, feature-missingness/freshness indicators , per-component latency.
  • Marketplace/ranking diagnostics: eligible/serviceable pool size, fallback-triggered flag, stale-feature usage, store-level exposure.
  • Downstream outcomes with attribution window: click, add-to-cart, order, basket size, cancellation.

Part 5 — Practical validity threats

Explain how to handle sample ratio mismatch, delayed conversions, feature missingness, novelty effects, selection bias, and spillover/interference. For each, give a concrete diagnostic or mitigation.

What This Part Should Cover

  • SRM : define it (planned vs observed split), what it signals (assignment/logging bugs, treatment-induced crashes, geo routing), and that you do not trust lift until it's resolved.
  • Delayed conversions : an attribution window and analysis window; why reading too early biases toward click-heavy variants.
  • Feature missingness : missingness indicators, freshness logging, and slicing by missing vs non-missing so you don't confuse infra quality with model quality.
  • Novelty effects, selection bias, spillover/interference : monitor for novelty decay over time; log eligible-pool composition so eligibility shifts aren't read as lift; mitigate interference with switchback/zone clustering and marketplace-level outcome monitoring.

Part 6 — Power / MDE and variance reduction

Explain how to estimate power / MDE, and when stratification or CUPED would help. Show the quantitative reasoning, including how clustering changes the math.

What This Part Should Cover

  • A power/MDE estimate for a binary metric using something like n≈16 p(1−p)/δ2n \approx 16\,p(1-p)/\delta^2n≈16p(1−p)/δ2 per arm, worked on a concrete baseline (e.g. p=0.10p = 0.10p=0.10 , δ=0.005\delta = 0.005δ=0.005 ).
  • The design effect for clustered/switchback designs: neff=n/(1+(m−1)⋅ICC)n_{\text{eff}} = n / (1 + (m-1)\cdot \text{ICC})neff​=n/(1+(m−1)⋅ICC) , and why geo-level tests need more traffic/duration.
  • CUPED : the adjustment Yadj=Y−θ (X−Xˉ)Y_{\text{adj}} = Y - \theta\,(X - \bar X)Yadj​=Y−θ(X−Xˉ) , what θ\thetaθ is, useful covariates (prior 7-day orders, prior sessions, pre-period spend), and the payoff (lower variance → smaller MDE → shorter test).
  • Stratification : which slices matter (new vs returning, dense vs sparse markets, supply conditions, platform), and the Simpson's-paradox risk if the traffic mix differs across arms.

Part 7 — Ramping, rollback, and launch decision

Define the criteria for ramping, rollback, and final launch. Give the ramp sequence, explicit rollback triggers, and a multi-factor launch decision (not just "is lift positive?").

What This Part Should Cover

  • A ramp sequence : offline replay/backtest with point-in-time features (NDCG, log loss, calibration) → shadow mode → canary ramp (e.g. 1%→5%→25%→50%) → full experiment over complete weekly cycles.
  • Rollback triggers : latency regression, ETA/cancellation degradation, elevated missingness/timeouts, SRM/logging corruption, severe merchant-exposure skew.
  • A launch decision framework : primary metric significant AND guardrails intact AND robust across key slices AND gain large enough to justify infra/latency cost AND not dependent on fragile real-time features AND not just novelty.

What a Strong Answer Covers

Across all seven parts, a strong answer should read as one coherent plan rather than seven disconnected checklists. Cross-cutting signals the interviewer is looking for:

  • Internal consistency — the Part 2 randomization choice, the Part 4 logging, the Part 5 interference handling, and the Part 6 power math reinforce each other (e.g. choosing switchback implies a design-effect penalty and marketplace-level logging).
  • Marketplace/causal sophistication — treats interference, SUTVA, and two-sided effects as first-class, not an afterthought.
  • Operational safety as a co-equal goal — guardrails, shadow mode, and staged rollout are integral, not bolt-ons.
  • Quantitative grounding — concrete metric, baseline, MDE, and a defensible launch bar.
  • Judgment over dogma — defaults with explicit switch conditions, and a launch decision that weighs lift against complexity, latency, and fragility.

Follow-up Questions

  • Suppose user-level SRM is clean but the geo-time switchback shows a strong day-part interaction (treatment wins at lunch, loses at dinner). How do you decide whether to launch, and to whom?
  • V2.0's lift comes mostly from a single real-time feature with a 2% serving timeout rate. How do you quantify how much of the measured lift is the model versus infra quality, and what would you require before launch?
  • Online conversion is up +0.4% but offline NDCG was flat. How do you reconcile this, and which do you trust?
  • After full launch, the lift decays over three weeks toward zero. How do you distinguish a novelty effect from a genuine regression introduced by ramping, and what experiment would you run to find out?
Loading comments...

Browse More Questions

More Analytics & Experimentation•More DoorDash•More Data Scientist•DoorDash Data Scientist•DoorDash Analytics & Experimentation•Data Scientist Analytics & Experimentation

Write your answer

Your first approved answer each day earns 20 XP.

Sign in to write your answer.
PracHub

Master your tech interviews with 8,000+ real questions from top companies.

Product

  • Questions
  • Learning Tracks
  • Interview Guides
  • Resources
  • Premium
  • For Universities
  • Student Access

Browse

  • By Company
  • By Role
  • By Category
  • Topic Hubs
  • SQL Questions
  • AI Coding Questions
  • Compare Platforms
  • Discord Community

Support

  • support@prachub.com
  • (916) 541-4762

Legal

  • Privacy Policy
  • Terms of Service
  • About Us

© 2026 PracHub. All rights reserved.