You’re running a virtual launch (soft roll-out) of a new fitness tracker product to US+CA users from 2025-08-10 to 2025-08-24 with a 50/50 user-level split (Control vs Variant B). A coordinated marketing push (email + paid + influencers) overlaps the test week, causing contamination and uneven exposure. Data quality quirks emerged: (a) sample ratio is 52:48, not 50:50; (b) purchase events on iOS were dropped for the first 48 hours (2025-08-10 to 2025-08-11); (c) a bug on 2025-08-18 caused an unusual spike in refunds. Pre-period baselines: DAU in eligible geos ≈ 200k; 14-day purchase conversion = 6%; ARPU = $2.20; refund rate = 3% of revenue. Marketing platform logs include user-level ad impressions and email sends.
Design a rigorous analysis plan and decision framework that addresses the messy data and marketing confounds:
-
Randomization & exposure: What unit (user, device, geo, or hybrid) and exposure rule would you choose to minimize contamination and noncompliance? How would you handle users who see ads but never get randomized, or who cross over variants across platforms?
-
Metrics: Define a primary success metric and at least 3 guardrails (e.g., refund rate, complaint rate, latency, churn). Specify how each is computed, including windows (e.g., 14-day from first exposure) and exclusion rules.
-
Validity checks: Describe specific diagnostics for SRM, missing instrumentation, novelty effects, and day-of-week seasonality. For each, state the statistical test or threshold you’ll use and what actions you’d take if it fails.
-
Bias mitigation: Propose a concrete approach to adjust for the concurrent marketing push (e.g., geo diff-in-diff with ad intensity as a covariate, CUPED with pre-period spend or engagement, inverse propensity weighting using ad impression propensity). Justify trade-offs among these methods.
-
Power & duration: With baseline 6% conversion, 50/50 split, α=0.05 two-sided, 80% power, and 14-day conversion window, compute the minimum detectable relative lift if you can expose ≈ 2.8M eligible users over the test (assume independence and a binomial variance). Is the test adequately powered? If not, propose changes.
-
Decision under messiness: Suppose after your adjustments the estimated lift in 14-day conversion is +3.5% (95% CI: −0.5%, +7.5%), ARPU is +1.2%, and refund rate increases by +1.1pp. Would you recommend launch, guardrail-triggered rollback, or extended test? State the exact thresholds that drive your decision and how you’d communicate the trade-offs to marketing and product.