Experiment Design: Spam-Detection Algorithm for Messenger
You are evaluating a new spam-detection algorithm that routes suspected spam into a separate folder and slightly delays delivery for additional checks. Design the experiment, decide whether to launch, and explicitly assess whether cluster randomization is appropriate.
Answer the following:
-
Primary decision and metrics
-
Define the primary success metric as "spam reply rate" = probability a recipient replies within 24 hours to a message flagged as spam.
-
Propose at least three guardrail metrics (e.g., delivery latency, false-positive rate on non-spam, user-initiated spam reports) and two secondary metrics (e.g., block rate, conversation retention).
-
For each metric, specify precise denominators and attribution windows.
-
Unit of randomization
-
Compare individual delivery-level randomization vs cluster randomization at:
a) conversation/thread,
b) recipient-user (ego), and
c) geo (country or data-center switchback).
-
For each, identify interference pathways in messaging networks (e.g., sender in control → recipient in treatment; multi-party threads; new threads formed mid-test) and when SUTVA is most likely violated.
-
State which unit you choose and why.
-
Cluster randomization pitfalls
-
Explain problems unique to cluster designs: inflated variance from ICC, unequal/variable cluster sizes, cluster drift (members join/leave threads), and treatment leakage (new threads not bound to cluster).
-
Give concrete mitigation tactics: cluster locking via stable hashing, intent-to-treat with cluster-level assignment logs, cluster-robust (CR2/CR3) or randomization-inference SEs, and weighting choices (cluster-weighted vs message-weighted) with rationale.
-
Power and sample size
-
Suppose baseline spam reply rate is 2.0%, target to detect a 10% relative reduction (to 1.8%), α = 0.05, power = 0.8, 7-day test. You expect 200M suspected-spam messages/day globally.
-
If clustering by thread with average m = 3 suspected-spam messages per thread over 7 days and ICC = 0.07 for the 24h-reply outcome: (i) compute the design effect DEFF = 1 + (m − 1)·ICC, (ii) compute the effective sample size versus individual randomization, and (iii) explain how this changes the MDE.
-
If instead clustering by recipient with m = 20 messages/recipient and ICC = 0.02, repeat the calculation and recommend a design.
-
Analysis plan
-
Detail the estimator and inference: difference-in-means at the cluster level vs message level with CR2/CR3 SEs or mixed-effects logistic regression (random intercept for cluster), CUPED using a 14-day pre-period, and a pre-registered tie-break for ambiguous threads (e.g., cluster by min(user_id) hash).
-
Specify how you’ll handle multiple exposure types (flag only vs flag+delay) and noncompliance.
-
Launch decision
-
Given plausible effect sizes (e.g., −8% to −12% relative on spam reply rate) and guardrails not regressing, state the quantitative launch criterion and the minimal ramp plan (e.g., 5% → 25% → 100%) with geo holdout and a 1% long-term holdback for ongoing monitoring.