Network Interference And Cluster Randomization

What's being tested

Interviewers are probing whether you can design and analyze an experiment when SUTVA — the assumption that one user’s outcome is unaffected by another user’s treatment — is violated. At Meta, many products are networked: `Messenger`, `WhatsApp`, `Facebook Groups`, `Instagram` sharing, feed ranking, invites, notifications, and recommendations can all create spillovers across friends, groups, creators, or conversation threads. A strong Data Scientist must define the causal estimand, choose an appropriate randomization unit, reduce contamination, and analyze clustered or dependent outcomes without overstating precision. The interviewer is usually less interested in a perfect graph algorithm and more interested in whether you can reason clearly about tradeoffs between bias, variance, power, interpretability, and product risk.

Core knowledge

Network interference occurs when treatment assigned to unit $i$ affects outcome $Y_j$ for another unit $j$ . A naive user-level A/B test can dilute or bias effects if treated users interact with control users through messages, calls, comments, shares, invites, or recommendations.
SUTVA violations come in two forms: hidden versions of treatment and interference between units. In social products, the second is common: a treated user may invite control friends, change group activity, increase content supply, or alter ranking feedback loops.
Estimand definition is the first move. Distinguish direct effects, spillover effects, and total effects:
$\text{Total effect} = E[Y_i(\mathbf{1}) - Y_i(\mathbf{0})]$
while a direct effect may compare $Y_i(Z_i=1, Z_{-i})$ versus $Y_i(Z_i=0, Z_{-i})$ under a specified exposure environment.
Exposure models translate a graph into treatment conditions. Examples: user is treated if assigned treatment; user is “network-exposed” if at least $k$ friends are treated; group is exposed if $\geq 50\%$ of active members are treated. The exposure definition should map to the product mechanism, not just graph convenience.
Cluster randomization assigns treatment at the cluster level, such as household, friend-community, creator-audience community, `Messenger` thread, `WhatsApp` group, school, workplace, or geography. It reduces cross-arm contamination but usually increases variance because observations within clusters are correlated.
Graph clustering tries to maximize within-cluster edges and minimize cross-cluster edges. Common approaches include Louvain modularity, METIS graph partitioning, label propagation, and connected components for small closed networks. DS candidates should discuss quality metrics like edge cut rate, cluster size balance, and expected contamination.
Intracluster correlation drives power loss. With average cluster size $m$ and intracluster correlation $\rho$ , the approximate design effect is:
$DE = 1 + (m - 1)\rho$
so effective sample size is roughly $n / DE$ . Large uneven clusters can severely reduce power even when raw user count is high.
Cluster size imbalance matters. A few massive clusters can dominate the estimate or force awkward assignment. Practical mitigations include stratifying by cluster size, trimming or separately handling giant clusters, weighting clusters carefully, and reporting both user-weighted and cluster-weighted estimates when interpretation differs.
Contamination measurement should be part of the design. Track metrics like percentage of user interactions crossing treatment arms, treated-control message edges, group calls with mixed assignment, recommendation impressions across arms, or share/invite flows. High contamination does not automatically invalidate a test, but it changes the estimand.
Analysis methods should match the assignment. For cluster-randomized tests, use cluster-level aggregation, cluster-robust standard errors, randomization inference, or hierarchical modeling. Treating millions of users as independent after cluster assignment is a classic way to produce fake significance.
Stratified randomization improves balance. Randomize clusters within strata defined by pre-period activity, geography, platform, cluster size, baseline `DAU`, call volume, or creator category. This is especially useful when the number of clusters is limited or outcome variance differs sharply across segments.
Fallback designs include ego-cluster randomization, geo experiments, switchback experiments, marketplace-level holdouts, or two-stage randomized designs. The right choice depends on whether interference travels through friend edges, groups, supply-demand matching, time, geography, or ranking feedback loops.

Worked example

For Design experiment for Group Calls with interference, a strong candidate would start by clarifying the product change: “Is this a feature that changes call initiation, call quality, notification ranking, or participant experience?” They would also ask whether treatment must be consistent for everyone in a call, because mixed treatment inside a single group call can create both user confusion and measurement contamination.

The answer should be organized around four pillars: define the estimand, choose the randomization unit, select metrics, and specify the analysis plan. For the estimand, they might say: “I care about the total effect of launching this feature to a communication community, not just the direct effect on isolated treated users, because calls require multiple participants.” For randomization, they could propose cluster assignment at the `Messenger` group-thread or communication-community level, depending on whether calls mostly happen in persistent groups or across overlapping friend sets.

For metrics, the primary metric could be calls per active group, successful call joins, or call minutes per eligible user, with guardrails such as call drop rate, blocked/muted users, notification opt-outs, and negative feedback. The candidate should explicitly flag the tradeoff: randomizing at the group level reduces within-call contamination but may miss spillovers when users belong to many groups; randomizing at a broader graph-community level reduces spillovers further but costs power and may create uneven clusters.

For analysis, they would aggregate outcomes at the assigned cluster level or use cluster-robust inference, stratifying randomization by baseline call volume and cluster size. They should close by saying: “If I had more time, I’d validate the exposure model using pre-period call graphs, estimate the expected cross-arm edge rate, and run a power calculation using intracluster correlation rather than raw user count.”

A second angle

For Implement Clustered Sampling to Mitigate Network Effects in Testing, the same ideas apply, but the interviewer is likely probing the operational logic of choosing clusters rather than the product-specific metrics. The candidate should describe how they would construct clusters from an interaction graph, evaluate edge cuts, handle huge connected components, and balance treatment/control arms on pre-period outcomes. The key framing shift is from “what metric should this feature move?” to “what sampling and assignment scheme gives an interpretable causal estimate under spillovers?” A strong answer also admits that clustering is imperfect: the goal is not to eliminate all interference, but to reduce it enough that the remaining bias is understood, measured, and disclosed. The analysis still needs cluster-aware standard errors or randomization inference.

Common pitfalls

Pitfall: “Just randomize users and add friend count as a covariate.”

This is the tempting analytical mistake. Covariate adjustment can improve precision, but it does not solve interference if control outcomes are changed by treated neighbors. A better answer defines exposure conditions and chooses a randomization unit that aligns with the mechanism of spillover.

Pitfall: Talking only about graph algorithms without naming the causal estimand.

Louvain, `METIS`, and connected components are useful tools, but the interviewer is evaluating causal reasoning, not graph partitioning trivia. Lead with the business and causal question: direct effect, spillover effect, or total launch effect; then explain how clustering supports that estimand.

Pitfall: Ignoring power and variance.

Cluster randomization often sounds like the “safe” answer, but it can destroy effective sample size when clusters are large or highly correlated. A strong candidate mentions design effect, number of independent clusters, stratification, and the risk that a cluster-level test may be underpowered even with millions of users.

Connections

Interviewers may pivot from this topic into causal inference, especially SUTVA, potential outcomes, and spillover estimands. They may also ask about variance estimation, power analysis, metric design, marketplace experiments, geo experiments, or ranking/recommender evaluation under feedback loops. For Meta-style product analytics, expect follow-ups on guardrail metrics, heterogeneous treatment effects, and launch decisions under imperfect experimental evidence.

What's being tested

Core knowledge

Worked example

A second angle

Common pitfalls

Connections

Further reading

Practice questions

Related concepts