Estimate ads ranking revenue impact
Company: Meta
Role: Data Scientist
Category: Analytics & Experimentation
Difficulty: medium
Interview Round: Onsite
You are the data scientist for an ads ranking team at a large social platform. The team has built a new ranking algorithm for feed ads. The new model changes the ordering of ads by combining **bid**, **predicted click-through rate (pCTR)**, **predicted conversion rate (pCVR)**, and **ad quality** differently from the current production ranker.
A short ramp suggests that revenue per daily active user (DAU) increased, but the team is worried that the short-term lift may not represent the medium-term impact: users may adapt to the new ad mix, advertisers may change bids or budgets, and auction dynamics may shift.
Your task is to design an approach to estimate the **medium-term revenue impact** of launching the new ads ranking algorithm over a **4- to 8-week horizon**, and to make a launch recommendation.
### Constraints & Assumptions
- Platform scale: a large eligible DAU base (a specific headcount is supplied in Part 5 for the scale-up); revenue is driven by an ad auction in which advertisers set daily/lifetime budgets that are paced over time.
- The intervention is a **feed ranking change**, experienced at the user level, but it interacts with a **shared marketplace** (advertiser budgets are pooled across users).
- A short (e.g. 2-day) ramp already showed a positive revenue-per-DAU signal; the open question is whether that lift persists, decays, or reverses over weeks.
- You have access to standard experimentation infrastructure (A/B testing, geo holdouts), pre-period covariates, and advertiser-side delivery/pacing data.
- "Medium-term" specifically means capturing user adaptation, advertiser bid/budget response, and at least one full advertiser budget cycle.
### Clarifying Questions to Ask
- Is the goal to estimate revenue at **full rollout** (the launch decision) or just the effect within the experiment population? This determines whether spillover is bias or part of the estimand.
- What is the relevant **advertiser budget cycle length**, and roughly what fraction of revenue comes from budget-constrained advertisers?
- What guardrail thresholds (retention, negative feedback, advertiser ROAS) are considered launch-blocking versus monitor-only?
- Do we have a usable **geo/market holdout** and pre-period covariates for variance reduction (e.g. CUPED)?
- Is there an existing **revenue or LTV model** the business uses to value medium-term user/advertiser effects?
- What statistical power / minimum detectable effect (MDE) is achievable at each candidate randomization unit?
### Part 1 — Causal estimand
Define the precise causal quantity you want to estimate, including the population, the comparison, the time horizon, and the outcome(s) it should encompass.
```hint What "estimand" means here
Be explicit about the four pieces of any estimand: population (eligible users / markets), treatment vs. counterfactual (new ranker vs. keep the current ranker, evaluated at full rollout), horizon (4-8 weeks), and outcome. Ask whether a per-impression metric or a per-user / cumulative metric better matches the business question.
```
```hint What to encompass
The "revenue impact" of a ranking change is not just immediate auction revenue. Consider whether the estimand should also internalize medium-term marketplace effects — advertiser budget reallocation, future ad inventory from retained users, and advertiser value (conversions / ROAS) — because these feed back into revenue.
```
#### What a Strong Answer Covers
- A precise estimand naming the population, the counterfactual (full rollout vs. status quo), the 4-8 week horizon, and the outcome.
- Recognition that the launch decision implies a **full-equilibrium** estimand, not the partial-equilibrium effect on a small treated slice.
- An outcome broader than per-impression revenue, with awareness of why per-impression revenue is gameable.
### Part 2 — Experiment or quasi-experiment design
Specify the design you would run, including the **randomization unit**, why you chose it, and how long it must run. Explain the tradeoff between user-level and market-level (geo) randomization for a shared-auction product.
```hint Randomization unit tradeoff
Start from the no-interference (SUTVA) assumption and ask whether it holds when treated and control users draw from the same advertiser budgets. Weigh the statistical power of finer units against the contamination they allow.
```
```hint Duration
The horizon must cover the feedback loops you are worried about: weekly seasonality, advertiser budget/pacing cycles, bid re-optimization, and user adaptation. A 2-day ramp captures none of these.
```
#### What a Strong Answer Covers
- A randomization choice **justified by the interference tradeoff**, not a generic "run an A/B test."
- A clear comparison of user-level (power, clean UX read, but biased by spillover) vs. geo/market-level (internalizes the marketplace, but low power).
- A duration that spans at least one full advertiser budget cycle and emphasizes later, post-transient weeks.
### Part 3 — Primary, secondary, and guardrail metrics
Propose a metric tree: one (or co-) primary metric, secondary/diagnostic metrics that explain the *mechanism*, and guardrail metrics on the user and advertiser side.
```hint Picking the primary metric
Prefer a metric robust to impression-mix shifts. Revenue per impression can rise while total revenue per user falls (fewer / worse impressions). A per-eligible-user or cumulative-over-window revenue metric is harder to game.
```
```hint Diagnostics vs. guardrails
Secondary metrics should let you *explain* a revenue move (eCPM, ad load, fill rate, win price, pCTR/pCVR calibration, budget exhaustion time). Guardrails protect against regressions you would not trade revenue for (engagement, retention, negative feedback, advertiser ROAS/CPA, latency).
```
#### What a Strong Answer Covers
- A primary metric robust to impression-mix gaming (per-eligible-user / cumulative revenue, not per-impression).
- Diagnostic metrics chosen to **explain mechanism** (price vs. volume, calibration, pacing), not just describe.
- Guardrails on **both** the user side (engagement, retention, negative feedback) and the advertiser side (ROAS, CPA, churn), plus system health.
### Part 4 — Auction interference, budgets, seasonality, and heterogeneity
Explain how you would handle each of the four threats to validity below, and how each could bias a naive user-level read:
- **Auction interference / spillover** (shared advertiser budgets across treatment and control).
- **Advertiser budget constraints and pull-forward** (faster spend now ≠ more spend over the window).
- **Seasonality** (day-of-week, holidays, campaign cycles).
- **User-level heterogeneity** (effect varies by market, tenure, engagement, ad load).
```hint Interference
Ask which advertisers actually carry the spillover, and whether a design that internalizes the marketplace (e.g. a geo holdout) could check the user-level number. Think about the *direction* of the bias.
```
```hint Pull-forward and seasonality
Watch the *trajectory* over the window, not a single day, and read what the budget-pacing diagnostics are telling you. Concurrent randomization and pre-period covariates also help.
```
```hint Heterogeneity
Decide which segments matter *before* you look, so a varying effect is a pre-registered finding rather than a fishing expedition.
```
#### What a Strong Answer Covers
- For each threat: the **mechanism**, the **direction of bias** on a naive user-level read, and a concrete mitigation.
- Spillover treated as the headline threat, with constrained vs. unconstrained advertiser slices and a geo check.
- Pull-forward distinguished from real lift via **cumulative** revenue and pacing/exhaustion diagnostics; a stated variance-reduction technique for seasonality; pre-registered segments for heterogeneity.
### Part 5 — Translating to company-level revenue impact
Given a per-user effect estimate with a confidence interval, show how you would scale it to a company-level revenue figure over the window, and what corrections/caveats you would attach. For the scale-up, take the eligible population as **200M DAU** and a **28-day window**.
```hint Scaling and caveats
Scale the per-user-per-day lift by eligible DAU and the number of days, and propagate the confidence interval through the same multiplication (round only at the end). Then discount or flag the point estimate for known biases — budget pull-forward, advertiser ROI harm, UX-driven inventory loss, and any measured experiment spillover.
```
#### What a Strong Answer Covers
- Correct scale-up arithmetic: per-user-per-day lift × eligible DAU × days.
- A confidence interval propagated through the same multiplier (not just a point estimate).
- Honest caveats that move the headline number: pull-forward, ROI harm, inventory loss, and spillover (geo vs. user-level reconciliation).
### Part 6 — Launch recommendation under a UX-guardrail conflict
State the decision logic you would use when short-term revenue is positive but some user-experience guardrails worsen. Make the tradeoff explicit rather than defaulting to "ship" or "kill".
```hint Make the tradeoff quantitative
Distinguish a cosmetic guardrail move (e.g. a tiny rise in ad hides) from a load-bearing one (a measurable 28-day retention loss). Frame the decision in terms of long-term user and advertiser lifetime value, and consider ramp / segment-targeting as middle options, not just full launch vs. no launch.
```
#### What a Strong Answer Covers
- An explicit decision rule with three outcomes (launch / ramp-or-target / do-not-launch), not a binary.
- The conflict resolved against **long-term value** (user and advertiser LTV), not this window's dollars.
- A worked contrast between an acceptable guardrail move and a launch-blocking one.
### What a Strong Answer Covers
These dimensions span all parts and should be visible across the answer as a whole:
- **Keeping interference central.** The candidate never loses sight of the fact that a per-user intervention priced in a shared, budget-constrained auction violates SUTVA, and lets that drive the design, the metrics, and the company-level correction.
- **Cumulative-over-window thinking.** Revenue is read as a trajectory across a full budget cycle, never as a single-day or per-impression snapshot.
- **Quantitative honesty.** Estimates carry intervals, caveats are subtractive (they change the number), and the final recommendation is grounded in long-term user and advertiser lifetime value rather than the immediate lift.
### Follow-up Questions
- Suppose the user-level A/B shows +2.4% revenue but a concurrent geo holdout shows roughly 0%. How do you reconcile them, and which do you trust for the launch decision?
- The week-1 lift is +5% but week-4 is +1% and still positive. How do you tell budget pull-forward apart from genuine user adaptation, and how does it change your company-level estimate?
- Advertiser ROAS drops slightly in treatment. Why might that erode the revenue lift over a horizon longer than 8 weeks, and how would you monitor it post-launch?
- How would you design a long-term holdout to keep measuring the launched change after the experiment ends?
Quick Answer: This question evaluates competency in causal inference, experimentation design, metric construction, and marketplace economics for measuring ad ranking revenue within the Analytics & Experimentation domain.