Experiment Test Plan: Ad Scheduling Policy B vs A
Context
-
Objective: Evaluate a new ad scheduling policy (B) against status quo (A).
-
Readout window: 2025-08-26 to 2025-09-01 (inclusive), i.e., last 7 days ending today (2025-09-01).
-
Users can interact across web, iOS, and Android with multiple impressions per day. Known seasonality by day-of-week and time-of-day. Some creatives are long-form video. Offline conversions (site visits) arrive with a 24–48 hour delay.
-
Primary KPI: watch_time_per_impression (seconds).
-
Guardrails: CTR, skip_rate, daily_active_users (DAU), complaint_rate.
Tasks
Provide a precise test plan and analysis procedure addressing the following, with justifications:
-
Randomization unit and exposure control
-
Choose user-level, device-level, or impression-level randomization.
-
Describe exposure caps and how to prevent cross-contamination across platforms and time slots.
-
Specify the hashing/bucketing key.
-
Stratification and variance reduction
-
Specify strata (e.g., platform × day-of-week × time-slot).
-
State whether CUPED will be used, with exact pre-period dates.
-
Define the CUPED covariate and show the adjusted estimator formula.
-
Metric definitions
-
Formalize primary KPI and guardrails (numerators/denominators).
-
State whether analysis is intent-to-treat.
-
Explicitly handle zeros and outliers (e.g., winsorization rules).
-
Tail choice and statistical test
-
Justify one-tailed vs two-tailed for the primary KPI.
-
Choose an appropriate test (e.g., Welch’s t, stratified difference-in-means, or permutation) and provide the test statistic.
-
Sample size and stopping
-
Compute or outline the per-variant sample size for a +5% lift in mean watch_time_per_impression, baseline mean 42s, SD 55s, alpha=0.05 (two-sided), power=0.8.
-
Describe any sequential monitoring rule (e.g., always-valid methods) if interim looks are planned.
-
Readout
-
Define the exact 95% CI to report and how to pool across strata.
-
Explain multiple-testing adjustments for guardrails (e.g., Holm).
-
Include at least two diagnostic checks for randomization balance and two for seasonality/novelty effects.
-
Sensitivity to delayed offline conversions
-
Describe how you would re-run the analysis if offline conversions are incomplete for the last 48 hours, and how that affects the readout window.