Estimate Super Bowl QR ad sign-ups
Company: Coinbase
Role: Data Scientist
Category: Analytics & Experimentation
Difficulty: hard
Interview Round: Technical Screen
CoinFactory ran a 60-second Super Bowl TV spot on 2025-02-09 with a QR code to a signup page; successful sign-ups receive a $15 coupon. You must estimate incremental sign-ups attributable to the ad in the first 48 hours and quantify uncertainty.
Provide a concrete plan that includes:
1) Identification: propose at least two distinct methods (e.g., high-frequency time-series counterfactual with synthetic control, geo-lift/DiD with control DMAs, calibrated MMM short-horizon attribution). State the identifying assumptions explicitly.
2) Data you would use: minute-level traffic/sign-ups for the prior 8 comparable Sundays, QR UTM-tagged sessions and device/IP de-dup rules, coupon issuance/redemption logs (coupon_id, user_id, issued_at, redeemed_at), TV air-times/GRPs by DMA, press mentions timestamps, bot-filtering heuristics, app store ranking changes, and site latency/error logs.
3) De-duplication and leakage: handle multi-device scans, dark social reshares of the QR URL, bots, and post-game press coverage spillover. Explain how you’ll separate organic baseline from paid lift and how to attribute delayed sign-ups within the 48h window.
4) Back-of-the-envelope (compute): Suppose the logs show 12,000,000 QR scans, 40% remain after de-dup, landing→signup conversion is 22%, baseline is 50,000 sign-ups/day (absent the ad), and press coverage added an 8% lift to the baseline for the first 24h. A competitor ran a similar QR ad in 8 DMAs that constitute 12% of our reach and cannibalized 25% of our QR traffic there. Estimate incremental sign-ups and provide a 90% CI using a reasonable variance model; show each adjustment step (baseline subtraction, cannibalization, spillover).
5) Validation: cross-check with coupon redemptions (assume 20% redeem within 7 days) and with geo heterogeneity. Describe how you’d reconcile differences across the methods and decide on the final estimate.
Quick Answer: This question evaluates a data scientist's proficiency in causal inference and attribution, high-frequency time-series and geo-experimental design, event-level instrumentation and de-duplication, and statistical uncertainty quantification.