Test conversion difference and adjust for clustering
Company: Airbnb
Role: Data Scientist
Category: Statistics & Math
Difficulty: Medium
Interview Round: Technical Screen
Using aggregated results for the 7‑day window 2025‑08‑26..2025‑09‑01, evaluate statistical significance and power for conversion uplift, accounting for day‑level clustering:
Given totals: Control (C): visits n_C=10,240, bookings x_C=308; Treatment (T): visits n_T=10,180, bookings x_T=351.
1) Point estimates: compute p_C, p_T, absolute lift (p_T − p_C, in percentage points) and relative lift.
2) Significance: perform a two‑sided test for difference in proportions (unpooled standard error). Report z, p‑value, and a 95% CI for (p_T − p_C). State any continuity correction you apply.
3) Clustering: adjust for day‑level clustering with ICC=0.01 and 7 days per variant. Use design effect DE = 1 + (\bar{m} − 1)·ICC where \bar{m} = n_variant / 7. Recompute effective sample sizes n_eff = n / DE and provide an adjusted p‑value/CI. Explain assumptions and limitations of this correction.
4) Power and sample size: What total visits per variant are required to detect a 0.30 percentage‑point absolute lift from a 3.00% baseline at 80% power and alpha=0.05 using an unpooled z‑test? Show the formula and final n per variant. Then recompute with the design effect from ICC=0.01 to give a clustered n per variant and the implied experiment duration if each variant receives 2,000,000 visits/day.
5) Robustness: briefly describe how you would check day‑to‑day heterogeneity (e.g., Q‑test or interaction with weekday) and how that influences the decision to launch.
Quick Answer: This question evaluates proficiency in statistical inference for A/B testing—estimating and comparing conversion proportions, conducting two-sided hypothesis tests, adjusting for day-level clustering using ICC and design-effect corrections, and performing power and sample-size calculations; it belongs to the Statistics & Math domain for a Data Scientist role and combines conceptual understanding with practical application. It is commonly asked to assess an interviewee's ability to interpret conversion uplift under realistic experimental constraints, account for intra-cluster correlation when estimating effective sample sizes and uncertainty, and reason about experiment duration and robustness checks.