This question evaluates competence in experimental design, causal inference, online A/B testing and launch decision frameworks for ad recommendation systems, covering randomization/unit choice, metric definition and guardrails, power and duration estimation, validity checks, monitoring and visualization in the Analytics & Experimentation domain.

You are on the Ads team and just trained a new ad recommendation model meant to replace the current model in production. Design a rigorous plan to decide launch or not via online experimentation.
a) Experiment design: Choose the unit of randomization (user, session, impression, advertiser, or auction-level) and justify it given auction interference, budget pacing, and repeat exposures. How will you avoid cross-over and contamination? Would you run an A/A first? Describe the ramp plan and holdout strategy.
b) Metrics: Define the single primary decision metric and at least three guardrails spanning users, advertisers, and platform health (e.g., CTR vs revenue per mille, advertiser ROI/CPA, latency p95, ad complaints). Explain mean-of-ratios vs ratio-of-means for CTR and which you’ll use.
c) Power and duration: Assuming a baseline ads RPM of $0.50 and expecting a +1.5% relative lift, outline how you’d estimate required sample size and test duration at α=0.05, power=0.80. State key variance inputs you need and how you would obtain them (historical data, pre-period CUPED, variance reduction via stratification/paired switching).
d) Validity checks: List concrete pre- and in-experiment checks (sample ratio mismatch tests, covariate balance, novelty/fatigue effects, weekday/seasonality, advertiser mix shifts, outlier handling, sequential monitoring corrections, multiple comparisons control across many segments/placements).
e) You’re shown a time-series plot: y-axis = daily ad CTR; two lines (control vs treatment) over 28 days. Critique and improve this plot for decision-making. Be specific about: (i) plotting relative lift and the difference series with confidence/credible intervals, (ii) smoothing vs raw daily volatility, (iii) marking ramp changes and traffic splits, (iv) handling missing days and day-of-week effects, (v) showing heterogeneity by placement and user cohort.
f) Decision rule: Propose an explicit stopping/launch rule (e.g., group-sequential boundaries like Pocock/OBF or Bayesian decision threshold). Include how you’d detect and respond to negative movements in guardrails and how you’d validate long-term effects post-launch (e.g., holdback, switchback, post-launch CUPED DiD).