This question evaluates understanding of experimental design, hypothesis framing, statistical power and minimal detectable effect estimation, experiment diagnostics (such as sample-ratio mismatch and data loss), and causal inference methods for clustered rollouts, situated in the analytics & experimentation domain for data science roles.
A social-media platform plans to evaluate a new feed-ranking algorithm intended to increase daily active minutes (DAM) per user.
Assume you have historical data to estimate the baseline mean and standard deviation of DAM at the user-day level, and traffic is large enough to run a 50/50 split for at least one full weekly cycle.
(a) State the A/B test hypotheses (null and alternative). Choose an Overall Evaluation Criterion (primary metric) and appropriate guardrail metrics.
(b) Determine the minimal detectable effect (MDE) and the required per-group sample size for 95% power with a two-tailed test at α = 0.05. Show the formula and a small numeric example using reasonable assumptions from historical data.
(c) After launch, the dashboard shows a time series of average DAM by group. What checks would you perform to confirm experiment health (e.g., parallel pre-period, sample-ratio mismatch, data loss)? How would you diagnose and interpret a sudden mid-test dip?
(d) If rollout is geography-based (clusters) rather than randomized at the user level, explain how you would establish causal inference. Describe an analytic approach (e.g., difference-in-differences or synthetic control), key assumptions, and how you would validate them.
Login required