Design and power a frequency-cap experiment
Company: Netflix
Role: Data Scientist
Category: Analytics & Experimentation
Difficulty: hard
Interview Round: Onsite
A product team wants to raise the per-user rolling 7-day frequency cap for a large video ad campaign from 3 to 4 impressions. Design an experiment and provide power calculations that account for interference and clustering.
Context and requirements:
- Population: US users eligible for the campaign; expected 4,000,000 eligible users/day over the test.
- Randomization candidates: user_id, household_id, or geo cell; average household size among eligible users is m = 1.3; household-level ICC for the primary metric is 0.02.
- Primary metric: 7-day conversion rate per unique exposed user (any purchase within 7 days of first exposure), baseline p0 = 2.00%.
- Guardrails: daily unique reach, average session watch time, complaint rate per 1,000 impressions.
- Traffic allocation: 50% Treatment (cap=4), 50% Control (cap=3), planned duration 28 days, with 4 equally spaced interim looks (including final).
- CUPED: pre-period 7-day metric available with R^2 = 0.35 to reduce variance.
- Interference risks: auctions shared across campaigns, overlapping advertisers, cross-device households, and pacing controls.
Tasks:
(1) Choose the randomization unit and justify it with a causal diagram: specify where interference could occur and how your choice mitigates it; propose cross-campaign holdouts or ghost-bids if needed.
(2) Define precise metric formulas (numerators/denominators, exposure semantics, attribution window, de-duplication across devices) and the data you would log to compute them unambiguously.
(3) Compute the minimum per-arm sample size (unique users) to detect an absolute lift from 2.00% to 2.10% (Δ = +0.10 pp) with α = 0.05 (two-sided) and 1−β = 0.80 using a two-proportion z-test. Adjust for clustering via VIF = 1 + (m−1)·ICC, then adjust again for CUPED by multiplying variance by (1−R^2). Show the final effective sample size and discuss whether 28 days of traffic suffices.
(4) Specify sequential monitoring using O’Brien–Fleming boundaries for 4 looks: give approximate nominal α at each look and describe the decision rules.
(5) List at least three diagnostic checks (e.g., covariate balance on pre-period exposures, saturation by user quantile, auction pressure) and the exact plots you would produce. Explain how you would interpret each to decide whether to ship the higher cap.
Quick Answer: This question evaluates experimental design and causal inference skills, including power analysis, clustering and interference mitigation, metric engineering, sequential monitoring, and diagnostic interpretation for large-scale ad experiments.