How do I approach Analytics & Experimentation interview questions?

Analytics & Experimentation questions require understanding of core concepts and practice. PracHub provides solutions with explanations to help you master analytics & experimentation interviews.

What difficulty level is this interview question?

This is a hard difficulty Analytics & Experimentation question, commonly asked during Onsite rounds at DoorDash.

What role is this question designed for?

This question is commonly asked for Data Scientist candidates at DoorDash during technical interviews.

Evaluate a new ranking model | DoorDash Interview Question

Q: Evaluate a new ranking model

This question evaluates expertise in experimentation design and causal inference within two-sided marketplace environments. It tests the ability to handle interference, SUTVA violations, metric selection, and rollout safety when deploying ranking model upgrades — core competencies for data scientist roles focused on product analytics and A/B testing.

A food-delivery company serves homepage store recommendations with ranking model V1.1. A new model V2.0 adds several new features and may require a different feature-set configuration for treatment users.

Design an experimentation and rollout plan for this model upgrade. This is a two-sided marketplace: changing what the homepage shows can shift consumer demand, merchant exposure, courier utilization, and delivery ETAs — so the plan must combine product metric design, causal inference (interference / SUTVA), and operational safety, not just an A/B test on clicks.

The question is broken into seven parts. Treat them as one coherent plan: the metric, randomization, infrastructure, logging, validity threats, statistics, and launch criteria should all hang together.

Constraints & Assumptions

Two-sided marketplace: recommendations affect merchant demand, courier load, and delivery times, so one user's treatment can affect another user's experience (interference).
The eligible candidate pool is constrained: a store must be in delivery range and currently open to be shown — and that eligible set changes by time and location.
V2.0 may depend on additional features, including possibly real-time features, that V1.1 did not use. Treatment must be able to fetch a different feature bundle than control.
Homepage serving is latency-sensitive (low single-/double-digit ms budgets per retrieval path), so any added feature computation has a latency cost.
Assume meaningful but finite traffic — variance reduction and power planning matter; you cannot run forever.

Clarifying Questions to Ask

What is the company's true north — short-term orders/GMV, contribution margin, or long-term retention? This determines the primary metric.
How large a lift do we need V2.0 to deliver to justify the added infra complexity and any latency cost (i.e., what is the practically significant effect)?
How material is interference expected to be — does V2.0 mostly re-rank the same eligible stores, or does it change which stores get demand enough to move ETAs and supply?
What feature SLAs and freshness guarantees exist, and what is the current feature-missingness/timeout rate at serving time?
What is the baseline homepage-session→order conversion rate and current daily homepage traffic (needed for power/MDE)?
Are there existing experimentation primitives — a bucketing service, config/feature-flag system, switchback tooling — we must build on or around?

Part 1 — Primary success metric and guardrail metrics

Define the primary success metric and the important guardrail metrics for a homepage recommendation model in a two-sided delivery marketplace. Justify the primary metric over naive alternatives, and explain why guardrails are non-negotiable here.

What This Part Should Cover

A single, business-aligned primary metric (e.g. orders or GMV/contribution-margin per session) with an explicit argument for why it beats CTR.
Guardrail metrics spanning serving health (p95/p99 latency, timeout/error rate) AND marketplace health (delivery ETA, cancellation/refund rate, merchant-exposure concentration/fairness).
Recognition that maximizing immediate orders can degrade ETAs, courier load balancing, merchant fairness, and long-term supply diversity.
A short layer of secondary/diagnostic metrics (CTR, add-to-cart, reorder, basket/AOV, new-store discovery, retention) used to interpret, not decide.

Part 2 — Unit of randomization

Choose the unit of randomization — user-level, session-level, geo-level, or switchback/time-based — given that recommendations can affect merchant demand, delivery times, and marketplace balance. State your default and the condition under which you'd switch.

What This Part Should Cover

Explicit reasoning about SUTVA / interference : why user-level A/B can be biased when treatment changes which stores get demand.
A comparison of the options with honest pros/cons (power vs. interference-containment; session-level contamination across variants).
A decision rule : user-level when spillovers are small (modest re-rank of eligible stores); geo-time switchback or zone-clustered when the ranker materially shifts marketplace dynamics.
Awareness of the power cost of clustering (fewer independent units, geo heterogeneity).

Part 3 — Serving infrastructure for experiment-specific versions and feature configs

Describe how the serving infrastructure should support experiment-specific model versions and feature-set configuration, so control and treatment can safely fetch different feature lists. Show how you keep this safe and reproducible.

What This Part Should Cover

Deterministic bucketing (stable hash of user_id or geo-time bucket) that logs experiment ID and arm — no per-request flapping.
An experiment config service mapping arm → model version + feature bundle, so features are not hard-coded in the app (control: v1.1/bundle A; treatment: v2.0/bundle B).
A versioned feature registry (schema, types, freshness SLA, defaults, owners, optional/deprecated flags) and backward-compatible serving (safe defaults + a missingness indicator, never hard serving failures).
Shadow mode before live: compute V2.0 scores in parallel to compare latency, score distribution, missingness, and calibration.

Part 4 — Logging requirements

Specify what events and metadata must be logged so the experiment can be analyzed correctly and reproducibly.

What This Part Should Cover

Assignment-level fields: experiment ID, arm, unit ID (user/session/geo-time bucket), timestamp + timezone, model version , feature-config version .
Request/ranking-level fields: candidate set before ranking, ranked list shown, per-candidate scores where feasible, feature-missingness/freshness indicators , per-component latency.
Marketplace/ranking diagnostics: eligible/serviceable pool size, fallback-triggered flag, stale-feature usage, store-level exposure.
Downstream outcomes with attribution window: click, add-to-cart, order, basket size, cancellation.

Part 5 — Practical validity threats

Explain how to handle sample ratio mismatch, delayed conversions, feature missingness, novelty effects, selection bias, and spillover/interference. For each, give a concrete diagnostic or mitigation.

What This Part Should Cover

SRM : define it (planned vs observed split), what it signals (assignment/logging bugs, treatment-induced crashes, geo routing), and that you do not trust lift until it's resolved.
Delayed conversions : an attribution window and analysis window; why reading too early biases toward click-heavy variants.
Feature missingness : missingness indicators, freshness logging, and slicing by missing vs non-missing so you don't confuse infra quality with model quality.
Novelty effects, selection bias, spillover/interference : monitor for novelty decay over time; log eligible-pool composition so eligibility shifts aren't read as lift; mitigate interference with switchback/zone clustering and marketplace-level outcome monitoring.

Part 6 — Power / MDE and variance reduction

Explain how to estimate power / MDE, and when stratification or CUPED would help. Show the quantitative reasoning, including how clustering changes the math.

What This Part Should Cover

A power/MDE estimate for a binary metric using something like $n \approx 16\,p(1-p)/\delta^2$ per arm, worked on a concrete baseline (e.g. $p = 0.10$ , $\delta = 0.005$ ).
The design effect for clustered/switchback designs: $n_{\text{eff}} = n / (1 + (m-1)\cdot \text{ICC})$ , and why geo-level tests need more traffic/duration.
CUPED : the adjustment $Y_{\text{adj}} = Y - \theta\,(X - \bar X)$ , what $\theta$ is, useful covariates (prior 7-day orders, prior sessions, pre-period spend), and the payoff (lower variance → smaller MDE → shorter test).
Stratification : which slices matter (new vs returning, dense vs sparse markets, supply conditions, platform), and the Simpson's-paradox risk if the traffic mix differs across arms.

Part 7 — Ramping, rollback, and launch decision

Define the criteria for ramping, rollback, and final launch. Give the ramp sequence, explicit rollback triggers, and a multi-factor launch decision (not just "is lift positive?").

What This Part Should Cover

A ramp sequence : offline replay/backtest with point-in-time features (NDCG, log loss, calibration) → shadow mode → canary ramp (e.g. 1%→5%→25%→50%) → full experiment over complete weekly cycles.
Rollback triggers : latency regression, ETA/cancellation degradation, elevated missingness/timeouts, SRM/logging corruption, severe merchant-exposure skew.
A launch decision framework : primary metric significant AND guardrails intact AND robust across key slices AND gain large enough to justify infra/latency cost AND not dependent on fragile real-time features AND not just novelty.

What a Strong Answer Covers

Across all seven parts, a strong answer should read as one coherent plan rather than seven disconnected checklists. Cross-cutting signals the interviewer is looking for:

Internal consistency — the Part 2 randomization choice, the Part 4 logging, the Part 5 interference handling, and the Part 6 power math reinforce each other (e.g. choosing switchback implies a design-effect penalty and marketplace-level logging).
Marketplace/causal sophistication — treats interference, SUTVA, and two-sided effects as first-class, not an afterthought.
Operational safety as a co-equal goal — guardrails, shadow mode, and staged rollout are integral, not bolt-ons.
Quantitative grounding — concrete metric, baseline, MDE, and a defensible launch bar.
Judgment over dogma — defaults with explicit switch conditions, and a launch decision that weighs lift against complexity, latency, and fragility.

Follow-up Questions

Suppose user-level SRM is clean but the geo-time switchback shows a strong day-part interaction (treatment wins at lunch, loses at dinner). How do you decide whether to launch, and to whom?
V2.0's lift comes mostly from a single real-time feature with a 2% serving timeout rate. How do you quantify how much of the measured lift is the model versus infra quality, and what would you require before launch?
Online conversion is up +0.4% but offline NDCG was flat. How do you reconcile this, and which do you trust?
After full launch, the lift decays over three weeks toward zero. How do you distinguish a novelty effect from a genuine regression introduced by ramping, and what experiment would you run to find out?

Constraints & Assumptions

Two-sided marketplace: recommendations affect merchant demand, courier load, and delivery times, so one user's treatment can affect another user's experience (interference).
The eligible candidate pool is constrained: a store must be in delivery range and currently open to be shown — and that eligible set changes by time and location.
V2.0 may depend on additional features, including possibly real-time features, that V1.1 did not use. Treatment must be able to fetch a different feature bundle than control.
Homepage serving is latency-sensitive (low single-/double-digit ms budgets per retrieval path), so any added feature computation has a latency cost.
Assume meaningful but finite traffic — variance reduction and power planning matter; you cannot run forever.

Clarifying Questions to Ask

What is the company's true north — short-term orders/GMV, contribution margin, or long-term retention? This determines the primary metric.
How large a lift do we need V2.0 to deliver to justify the added infra complexity and any latency cost (i.e., what is the practically significant effect)?
How material is interference expected to be — does V2.0 mostly re-rank the same eligible stores, or does it change which stores get demand enough to move ETAs and supply?
What feature SLAs and freshness guarantees exist, and what is the current feature-missingness/timeout rate at serving time?
What is the baseline homepage-session→order conversion rate and current daily homepage traffic (needed for power/MDE)?
Are there existing experimentation primitives — a bucketing service, config/feature-flag system, switchback tooling — we must build on or around?

Part 1 — Primary success metric and guardrail metrics

What This Part Should Cover

A single, business-aligned primary metric (e.g. orders or GMV/contribution-margin per session) with an explicit argument for why it beats CTR.
Guardrail metrics spanning serving health (p95/p99 latency, timeout/error rate) AND marketplace health (delivery ETA, cancellation/refund rate, merchant-exposure concentration/fairness).
Recognition that maximizing immediate orders can degrade ETAs, courier load balancing, merchant fairness, and long-term supply diversity.
A short layer of secondary/diagnostic metrics (CTR, add-to-cart, reorder, basket/AOV, new-store discovery, retention) used to interpret, not decide.

Part 2 — Unit of randomization

What This Part Should Cover

Explicit reasoning about SUTVA / interference : why user-level A/B can be biased when treatment changes which stores get demand.
A comparison of the options with honest pros/cons (power vs. interference-containment; session-level contamination across variants).
A decision rule : user-level when spillovers are small (modest re-rank of eligible stores); geo-time switchback or zone-clustered when the ranker materially shifts marketplace dynamics.
Awareness of the power cost of clustering (fewer independent units, geo heterogeneity).

Part 3 — Serving infrastructure for experiment-specific versions and feature configs

What This Part Should Cover

Deterministic bucketing (stable hash of user_id or geo-time bucket) that logs experiment ID and arm — no per-request flapping.
An experiment config service mapping arm → model version + feature bundle, so features are not hard-coded in the app (control: v1.1/bundle A; treatment: v2.0/bundle B).
A versioned feature registry (schema, types, freshness SLA, defaults, owners, optional/deprecated flags) and backward-compatible serving (safe defaults + a missingness indicator, never hard serving failures).
Shadow mode before live: compute V2.0 scores in parallel to compare latency, score distribution, missingness, and calibration.

Part 4 — Logging requirements

Specify what events and metadata must be logged so the experiment can be analyzed correctly and reproducibly.

What This Part Should Cover

Assignment-level fields: experiment ID, arm, unit ID (user/session/geo-time bucket), timestamp + timezone, model version , feature-config version .
Request/ranking-level fields: candidate set before ranking, ranked list shown, per-candidate scores where feasible, feature-missingness/freshness indicators , per-component latency.
Marketplace/ranking diagnostics: eligible/serviceable pool size, fallback-triggered flag, stale-feature usage, store-level exposure.
Downstream outcomes with attribution window: click, add-to-cart, order, basket size, cancellation.

Part 5 — Practical validity threats

What This Part Should Cover

SRM : define it (planned vs observed split), what it signals (assignment/logging bugs, treatment-induced crashes, geo routing), and that you do not trust lift until it's resolved.
Delayed conversions : an attribution window and analysis window; why reading too early biases toward click-heavy variants.
Feature missingness : missingness indicators, freshness logging, and slicing by missing vs non-missing so you don't confuse infra quality with model quality.
Novelty effects, selection bias, spillover/interference : monitor for novelty decay over time; log eligible-pool composition so eligibility shifts aren't read as lift; mitigate interference with switchback/zone clustering and marketplace-level outcome monitoring.

Part 6 — Power / MDE and variance reduction

Explain how to estimate power / MDE, and when stratification or CUPED would help. Show the quantitative reasoning, including how clustering changes the math.

What This Part Should Cover

A power/MDE estimate for a binary metric using something like $n \approx 16\,p(1-p)/\delta^2$ per arm, worked on a concrete baseline (e.g. $p = 0.10$ , $\delta = 0.005$ ).
The design effect for clustered/switchback designs: $n_{\text{eff}} = n / (1 + (m-1)\cdot \text{ICC})$ , and why geo-level tests need more traffic/duration.
CUPED : the adjustment $Y_{\text{adj}} = Y - \theta\,(X - \bar X)$ , what $\theta$ is, useful covariates (prior 7-day orders, prior sessions, pre-period spend), and the payoff (lower variance → smaller MDE → shorter test).
Stratification : which slices matter (new vs returning, dense vs sparse markets, supply conditions, platform), and the Simpson's-paradox risk if the traffic mix differs across arms.

Part 7 — Ramping, rollback, and launch decision

Define the criteria for ramping, rollback, and final launch. Give the ramp sequence, explicit rollback triggers, and a multi-factor launch decision (not just "is lift positive?").

What This Part Should Cover

A ramp sequence : offline replay/backtest with point-in-time features (NDCG, log loss, calibration) → shadow mode → canary ramp (e.g. 1%→5%→25%→50%) → full experiment over complete weekly cycles.
Rollback triggers : latency regression, ETA/cancellation degradation, elevated missingness/timeouts, SRM/logging corruption, severe merchant-exposure skew.
A launch decision framework : primary metric significant AND guardrails intact AND robust across key slices AND gain large enough to justify infra/latency cost AND not dependent on fragile real-time features AND not just novelty.

What a Strong Answer Covers

Across all seven parts, a strong answer should read as one coherent plan rather than seven disconnected checklists. Cross-cutting signals the interviewer is looking for:

Internal consistency — the Part 2 randomization choice, the Part 4 logging, the Part 5 interference handling, and the Part 6 power math reinforce each other (e.g. choosing switchback implies a design-effect penalty and marketplace-level logging).
Marketplace/causal sophistication — treats interference, SUTVA, and two-sided effects as first-class, not an afterthought.
Operational safety as a co-equal goal — guardrails, shadow mode, and staged rollout are integral, not bolt-ons.
Quantitative grounding — concrete metric, baseline, MDE, and a defensible launch bar.
Judgment over dogma — defaults with explicit switch conditions, and a launch decision that weighs lift against complexity, latency, and fragility.

Follow-up Questions

Suppose user-level SRM is clean but the geo-time switchback shows a strong day-part interaction (treatment wins at lunch, loses at dinner). How do you decide whether to launch, and to whom?
V2.0's lift comes mostly from a single real-time feature with a 2% serving timeout rate. How do you quantify how much of the measured lift is the model versus infra quality, and what would you require before launch?
Online conversion is up +0.4% but offline NDCG was flat. How do you reconcile this, and which do you trust?
After full launch, the lift decays over three weeks toward zero. How do you distinguish a novelty effect from a genuine regression introduced by ramping, and what experiment would you run to find out?

Evaluate a new ranking model

Quick Overview