Design Messenger spam experiment with clustering
Company: Meta
Role: Data Scientist
Category: Analytics & Experimentation
Difficulty: hard
Interview Round: Technical Screen
Meta Messenger is considering launching a new spam-detection algorithm that routes suspected spam into a separate folder and slightly delays delivery for additional checks. You must design the experiment, decide whether to launch, and specifically assess whether cluster randomization is appropriate.
Answer the following:
1) Primary decision and metrics: Define the primary success metric as "spam reply rate" (probability a recipient replies within 24 hours to a message flagged as spam). Propose at least three guardrail metrics (e.g., delivery latency, false-positive rate on non-spam, user-initiated spam reports) and two secondary metrics (e.g., block rate, conversation retention). Specify precise denominators and attribution windows.
2) Unit of randomization: Compare individual-user randomization vs cluster randomization at (a) conversation/thread, (b) recipient-user (ego), and (c) geo (country or data-center switchback). For each, identify interference pathways in messaging networks (e.g., sender in control → recipient in treatment; multi-party threads; new threads formed mid-test) and when SUTVA is most likely violated. State which unit you choose and why.
3) Cluster randomization pitfalls: Explain problems unique to cluster designs: inflated variance from ICC, unequal/variable cluster sizes, cluster drift (members join/leave threads), and treatment leakage (new threads not bound to cluster). Give concrete mitigation tactics: cluster locking via stable hashing, intent-to-treat with cluster-level assignment logs, cluster-robust (HC2/HC3) or randomization-inference SEs, and weighting choices (cluster-weighted vs message-weighted) with rationale.
4) Power and sample size: Suppose baseline spam reply rate is 2.0%, target to detect a 10% relative reduction (to 1.8%), alpha=0.05, power=0.8, 7-day test. You expect 200M suspected-spam messages/day globally. If you cluster by thread with average m=3 suspected-spam messages per thread over 7 days and ICC=0.07 for the 24h-reply outcome: (i) compute the design effect DEFF=1+(m−1)*ICC, (ii) compute the effective sample size versus individual randomization, and (iii) explain how this changes the MDE. If instead clustering by recipient with m=20 messages/recipient and ICC=0.02, repeat the calculation and recommend a design.
5) Analysis plan: Detail the estimator and inference: difference-in-means at the cluster level vs message level with CR2/CR3 SEs or mixed-effects logistic regression (random intercept for cluster), CUPED using a 14-day pre-period, and a pre-registered tie-break for ambiguous threads (e.g., cluster by min(user_id) hash). Specify how you’ll handle multiple exposure types (flag only vs flag+delay) and noncompliance.
6) Launch decision: Given plausible effect sizes (e.g., −8% to −12% relative on spam reply rate) and guardrails not regressing, state the quantitative launch criterion and the minimal ramp plan (e.g., 5%→25%→100%) with geo holdout and a 1% long-term holdback for ongoing monitoring.