Cluster Randomized Experiments And Network Interference

What's being tested

Interviewers are probing whether you can design a credible causal experiment when the standard user-level A/B test assumption breaks: one user’s treatment can affect another user’s outcomes. Meta cares because many products are inherently social—Messenger, Feed, groups, recommendations, spam enforcement, fake-account removal—and naive randomization can underestimate effects, contaminate controls, or harm user experience. A strong Data Scientist should identify network interference, define the right estimand, choose a defensible cluster randomization strategy, and explain tradeoffs in power, bias, and operational risk. The interviewer is not looking for graph-engineering implementation details; they are testing your statistical reasoning, metric design, and ability to make a launch recommendation under imperfect isolation.

Core knowledge

SUTVA—the Stable Unit Treatment Value Assumption—requires no hidden treatment versions and no interference between units. In social products, SUTVA often fails: if Alice receives a spam-filter change, Bob’s messages_received, reply_rate, or spam_reports may change even if Bob is in control.
Interference means a unit’s outcome depends on other units’ treatment assignments: $Y_i = Y_i(Z_i, Z_{-i})$ . Common forms include direct user-to-user messaging, creator-viewer relationships, group interactions, marketplace buyer-seller effects, and adversarial ecosystems like spam or fake accounts.
Cluster randomization assigns treatment at a group level—communities, ego networks, conversation threads, households, geographic regions, schools, or graph partitions—so most interaction edges stay within the same assignment. The goal is not perfect isolation; it is reducing cross-arm exposure enough that the estimand is interpretable.
Graph partitioning is the usual mental model: create clusters that maximize within-cluster edges and minimize between-cluster edges, often using algorithms like Louvain community detection, METIS-style partitioning, connected components, or business-defined clusters such as group_id or conversation_id. The DS should focus on whether the resulting clusters are balanced, stable, interpretable, and low-contamination.
Contamination rate is a key diagnostic: for a treated cluster, what fraction of relevant exposures come from control clusters, and vice versa? A simple edge-weighted version is $\text{contamination} = \frac{\sum_{(i,j): Z_i \neq Z_j} w_{ij}}{\sum_{(i,j)} w_{ij}}$ where $w_{ij}$ might be message volume, impressions, replies, or historical interaction strength.
Estimand choice should be explicit. You may estimate the intention-to-treat effect at the cluster assignment level, the effect on highly exposed users, a spillover effect on neighbors, or a global ecosystem effect. “Average treatment effect on users” is often too vague when users have different exposure to treated peers.
Exposure mapping translates complex networks into analyzable conditions, such as “user is treated,” “at least 50% of inbound messages come from treated senders,” or “has 2+ treated close friends.” This enables comparisons like treated-high-exposure vs control-low-exposure, but thresholds must be pre-specified to avoid fishing.
Unit of analysis should usually match the unit of randomization or account for clustering. If randomizing by cluster but analyzing user-level rows as independent, standard errors are too small. Use cluster-level aggregation, cluster-robust standard errors, randomization inference, or hierarchical modeling depending on cluster count and metric structure.
Power is typically worse than in user-level A/B tests because effective sample size is closer to the number of clusters than the number of users. The design effect is approximately $DE = 1 + (m - 1)\rho$ where $m$ is average cluster size and $\rho$ is intra-cluster correlation. Large clusters and high correlation can make an experiment underpowered even with millions of users.
Cluster balance matters because social clusters can be highly skewed. You should check pre-period balance on DAU, messages_sent, spam_reports, geography, platform, tenure, and baseline outcome metrics. Use stratified or matched-pair randomization when clusters vary drastically in size or activity.
Metric selection should separate direct product goals, guardrails, and ecosystem effects. For a Messenger spam experiment, primary metrics might include spam_message_rate, user_report_rate, message_send_success, and reply_rate; guardrails might include false_positive_rate, blocked_legitimate_messages, retention, and sender/recipient experience split by segment.
Pre-launch analysis should quantify interference risk before the test: inspect the interaction graph, estimate cross-cluster edge share under candidate cluster definitions, simulate randomizations, compute minimum detectable effect, and identify sensitive segments. If contamination is too high, consider switchback designs, geo experiments, holdout networks, or staged rollouts.

Worked example

For “Design Messenger spam experiment with clustering”, a strong candidate would start by clarifying the treatment: “Are we changing the spam classifier threshold, sender enforcement, recipient warnings, or message delivery ranking?” They would then ask whose outcome matters—senders, recipients, conversation threads, or the broader messaging ecosystem—and declare that user-level randomization is risky because a treated sender can message a control recipient. The answer should be organized around four pillars: define the causal estimand, construct clusters from the messaging graph, choose primary and guardrail metrics, and plan inference/power under clustered assignment. For clustering, they might propose building clusters from recent high-weight messaging edges, then randomizing clusters after stratifying by size, country, and baseline spam rate. The primary estimand could be the intention-to-treat effect of enabling the new spam policy for all users in treated clusters on recipient-level spam_report_rate and legitimate message_delivery_rate. A specific tradeoff to flag is that larger graph clusters reduce cross-arm contamination but reduce the number of independent experimental units, lowering power and increasing sensitivity to outlier clusters. They should also mention analysis at the cluster level or with cluster-robust uncertainty, not naive per-message standard errors. A crisp close would be: “If I had more time, I’d run pre-period simulations to compare cluster definitions, estimate contamination, and decide whether this is feasible as an experiment or should start as a limited holdout plus observational spillover analysis.”

A second angle

For “Design experiment for fake accounts impact”, the same principles apply, but the treatment and interference path are broader. Removing or demoting suspected fake accounts affects real users who receive friend requests, comments, messages, follows, ads engagement, or content impressions from those accounts. The unit of clustering might be based on interaction neighborhoods around suspicious accounts, not just ordinary user communities, and the estimand may include spillover benefits to real users rather than outcomes for the treated accounts themselves. Metrics would include fake_account_prevalence, friend_request_accept_rate, content_integrity_reports, real_user_retention, and false-positive harm to legitimate accounts. The main constraint is ethical and operational: you may not want to knowingly leave harmful fake accounts active for long, so the design might use short exposure windows, risk-tiered eligibility, or phased rollout with strong guardrails.

Common pitfalls

Pitfall: Treating a networked product like a standard user-level A/B test.

The tempting answer is “randomize users 50/50 and compare spam_reports.” That ignores interference: treated senders can affect control recipients, and control senders can dilute treated recipients’ experience. A better answer explicitly states why SUTVA fails, then proposes cluster-level assignment or an exposure-based design.

Pitfall: Optimizing only for contamination and forgetting power.

Candidates often say “make clusters as large as possible so there is no spillover.” That can leave you with too few independent units, poor balance, and an unusable confidence interval. The stronger framing is a bias-variance tradeoff: reduce cross-arm edges while preserving enough clusters and pre-period balance for credible inference.

Pitfall: Describing clustering mechanics without tying them to the decision.

It is not enough to name Louvain or graph partitioning. The interviewer wants to know what metric you are trying to move, what causal effect you can estimate, how you will compute uncertainty, and what result would justify launch. Keep connecting design choices back to the product decision and the estimand.

Connections

Interviewers may pivot from this topic into difference-in-differences, synthetic controls, switchback experiments, geo experiments, power analysis, or variance reduction with pre-period covariates. They may also ask how to diagnose heterogeneous effects across countries, tenure, high-degree users, or abuse-risk segments after the clustered test.

What's being tested

Core knowledge

Worked example

A second angle

Common pitfalls

Connections

Further reading

Featured in interview prep guides

Practice questions

Related concepts