This question evaluates a data scientist's competency in experimental design, statistical inference, and causal-effect estimation for clustered A/B tests, including sample size calculation with design effect, selection of cluster-robust hypothesis tests, sequential alpha-spending, multiple-testing control for guardrails, and quantifying contamination bias. It is commonly asked in the Statistics & Math domain to assess the ability to formalize decision rules that control type I/II errors and interpret confidence intervals under clustering and interim looks, testing both conceptual understanding of statistical principles and practical application of experiment governance.

You will formalize the statistical decision rules for the Instagram button experiment described above. Given: baseline exploration rate (p0) = 0.15 per user-week, desired MDE = +1.5 percentage points (absolute), two-sided alpha = 0.05, power = 0.80. Randomization is at the cluster level with average cluster size m = 500 users and intra-cluster correlation ICC = 0.02. (a) Derive the required number of users and clusters per arm using a proportions test adjusted for clustering (design effect DE = 1 + (m − 1)·ICC). Show formulas and final numbers. (b) Specify the exact hypothesis test you would use for the primary metric (e.g., cluster-robust z-test on cluster means vs user-level test with cluster-robust standard errors). Explain when a nonparametric alternative would be preferable. (c) Define the confidence interval you will report and how you’ll interpret it jointly with practical significance. (d) You will look at the primary metric each week for 4 weeks. Choose and justify a sequential testing plan (e.g., O’Brien–Fleming alpha-spending), and show the adjusted per-look alphas. (e) You track 3 guardrails: p95 latency, crash rate, and add-to-cart rate. Describe a multiple-testing control that preserves power on the primary (e.g., hierarchical testing or Holm–Bonferroni) and write the decision logic combining primary and guardrails. (f) If contamination causes 10% of control users to see the button, quantify the bias direction for ITT and outline a correction (e.g., CACE with instrumented assignment).