Causal Inference, Difference-In-Differences, And Cannibalization

What's being tested

Meta is testing whether you can turn an ambiguous product or marketing question into a defensible causal estimand, not just describe correlations in `DAU`, clicks, conversions, or revenue. The interviewer wants to see if you can choose between randomization, difference-in-differences, matched-market tests, synthetic control, and other quasi-experimental designs based on what is feasible and what assumptions are credible. For a Data Scientist, the core skill is knowing when growth in one source is truly incremental versus shifted from another source, and how to quantify uncertainty under messy product realities like interference, staggered rollout, selection, and seasonality. Meta cares because product launches, ranking changes, ads, notifications, and creator incentives often affect ecosystems where users, advertisers, and content sources interact.

Core knowledge

Causal estimand comes before method choice. Define whether you need `ATE`, `ATT`, incremental conversions, net revenue, retained users, or substitution across channels. For cannibalization, estimate both source-level lift and total-system lift: $\Delta \text{Total} \approx \Delta A + \Delta B + \Delta C$ .
Cannibalization means a treatment increases one metric source while decreasing another, leaving aggregate value flat or negative. A strong answer decomposes Total Engagement into components like Feed, Reels, Stories, Search, or Notifications, then tests whether source gains are offset elsewhere.
Difference-in-differences compares treated and control units before and after treatment:
$\hat{\tau}_{DiD}=(\bar{Y}_{T,post}-\bar{Y}_{T,pre})-(\bar{Y}_{C,post}-\bar{Y}_{C,pre})$
It identifies causal effects if parallel trends would have held absent treatment.
Parallel trends diagnostics should use multiple pre-periods, not just one baseline. Plot event-study coefficients, test pre-treatment leads, and compare seasonality. A non-significant pre-trend test is not proof; visual and business plausibility matter, especially around holidays, launches, or news cycles.
Two-way fixed effects can fail under staggered adoption when treatment effects vary over time or cohorts. Avoid blindly running $Y_{it}=\alpha_i+\gamma_t+\beta D_{it}+\epsilon_{it}$ ; consider Callaway-Sant’Anna, Sun-Abraham, or cohort-specific event studies for cleaner group-time effects.
Geo experiments or matched-market lift tests are useful when user-level randomization is impossible, especially for brand ads, shuttles, local launches, or market-level spend changes. With only ~20–200 geo clusters, use matched pairs, pre-period outcome correlation, cluster-robust errors, and randomization inference.
Synthetic control constructs a weighted combination of untreated units to match the treated unit pre-period. It is strongest with one or few treated geos and many donor markets; avoid donors that may be exposed through spillovers, shared media markets, or national campaigns.
Interference violates the stable unit treatment value assumption when one user’s treatment affects another user’s outcome. At Meta, social graph effects are common: notifications, shares, comments, marketplace liquidity, and ad auctions can spill across treatment and control users.
Cluster randomization handles interference by randomizing groups such as geos, schools, workplaces, advertiser accounts, or graph clusters. The tradeoff is power: effective sample size becomes number of clusters, not users, so variance is often much larger than in a user-level A/B test with millions of users.
Exposure mapping clarifies what “treated” means under spillovers. Instead of binary assignment only, model outcomes by own treatment and neighbor exposure, e.g. $Y_i=f(Z_i,\frac{1}{d_i}\sum_{j \in N(i)} Z_j)$ . This helps separate direct effects from network effects.
Robustness checks make observational causal claims credible. Use placebo outcomes, placebo treatment dates, negative-control geos, alternative control groups, leave-one-market-out sensitivity, covariate balance, pre-period fit, and heterogeneity by segment such as new users, heavy users, advertisers, or creators.
Power and MDE should match the design unit. For clustered tests, approximate detectable effect using cluster-level variance and intraclass correlation, not raw user count. A campaign with 100M impressions but 30 markets may still be underpowered for small brand-lift effects.

Tip: In interviews, say the identifying assumption out loud, then explain how you would try to falsify it. That is usually more valuable than naming a fancy method.

Worked example

For Prove source growth is cannibalization, not incremental, start by clarifying what “source” means: a traffic channel, product surface, recommendation module, notification type, or ad placement. Then define the business outcome hierarchy: source-level Sessions or Clicks is not enough; you need aggregate Time Spent, Revenue, Retention, or another north-star metric to determine incrementality. A strong first-30-seconds frame is: “I’d estimate the causal effect of increasing exposure to this source on both its own metric and total platform value, then test whether gains are offset by declines in substitute sources.”

Organize the answer into four pillars: metric decomposition, identification strategy, validation checks, and decision rule. For identification, prefer a randomized holdout if feasible: randomly suppress or reduce the source for eligible users, compare total outcomes, and decompose deltas by surface. If randomization is not feasible, propose DiD across users, geos, or cohorts with differential exposure changes, using pre-trend diagnostics and matched controls. The key design decision is whether the unit of analysis should be user-level or cluster-level; if users interact or content supply is shared, cluster randomization or geo-level analysis may be safer despite lower power. Close by saying you would report both gross source lift and net incremental lift: “If source A rises 10M sessions but total sessions rise only 1M while source B falls 9M, I would call most of the growth cannibalized.” If you had more time, you would check heterogeneous effects by new versus existing users, heavy versus light users, and short-term novelty versus longer-term retention.

A second angle

For Design and analyze A/B test with interference, the same causal logic applies, but the main threat is not omitted-variable bias; it is spillover between experimental units. A naive user-level A/B test might underestimate or overestimate impact if treated users create content, send messages, invite friends, or affect auction prices seen by control users. The candidate should define direct, indirect, and total effects, then choose cluster randomization, graph-cluster assignment, geo randomization, or an exposure model. The tradeoff is precision versus validity: user-level randomization gives huge sample size but biased estimates under interference, while cluster-level randomization reduces contamination but may have much larger standard errors. The analysis should use cluster-level variance estimation, randomization inference when clusters are few, and diagnostics for cross-cluster exposure.

Common pitfalls

Pitfall: Treating correlation as causation because one source grew while another declined.

A weak answer says, “Source A grew and total stayed flat, so it cannibalized.” A better answer explains the counterfactual: what would total and other sources have done absent the change? Seasonality, concurrent launches, ranking changes, or user mix shifts could produce the same pattern without cannibalization.

Pitfall: Reciting DiD without defending parallel trends.

Interviewers often push on why the control group is valid. Do not just write the DiD formula; explain how you would select controls, inspect pre-period trends, run placebo dates, and handle staggered adoption or heterogeneous effects.

Pitfall: Communicating only methods, not decisions.

A common depth mistake is listing “DiD, synthetic control, IV, regression discontinuity” without tying them to launch/no-launch guidance. Land the business interpretation: incremental lift, confidence interval, substitution pattern, segment heterogeneity, and what decision you would recommend under uncertainty.

Connections

Expect pivots into experiment design, especially cluster randomization, power, `MDE`, and guardrail metrics. If causal validity becomes the focus, the interviewer may ask about instrumental variables, regression discontinuity, synthetic control, or selection bias. For Meta-specific product contexts, be ready to connect this to ranking changes, ads measurement, notification experiments, marketplace liquidity, and recommender-system metric tradeoffs.

What's being tested

Core knowledge

Worked example

A second angle

Common pitfalls

Connections

Further reading

Practice questions

Related concepts