Causal Inference And Difference-In-Differences

What's being tested

Meta Data Scientists are expected to separate causal impact from correlation in messy product ecosystems where user behavior, network effects, ranking systems, and ads markets interact. Interviewers are probing whether you can choose an identification strategy, define treatment/control units, specify metrics, diagnose bias, and explain assumptions clearly enough for launch or investment decisions. The recurring skill is not “run a regression”; it is designing credible evidence when simple A/B testing may be contaminated, underpowered, or ethically/product-wise difficult. Strong answers combine experiment design, difference-in-differences, geo lift, logging/instrumentation reasoning, and metric tradeoff judgment.

Core knowledge

Randomized experiments are the gold standard because treatment assignment is independent of potential outcomes: $T_i \perp (Y_i(1), Y_i(0))$ . For Meta product tests, think carefully about the randomization unit: user, account, session, device, household, advertiser, page, creator, or geo.
Difference-in-differences estimates causal lift when treatment and control groups are not randomly assigned but have comparable pre-period trends. The basic estimator is:
$\hat{\tau}_{DID} = (\bar{Y}_{T,post} - \bar{Y}_{T,pre}) - (\bar{Y}_{C,post} - \bar{Y}_{C,pre})$
The core assumption is parallel trends, not equal levels.
Parallel trends should be defended empirically and conceptually. Plot pre-period metric trajectories, estimate placebo effects in pre-periods, compare seasonality, and explain why no simultaneous shock affects treatment differently. If pre-trends diverge, consider matched controls, synthetic control, covariates, or abandon causal claims.
Geo experiments are useful when user-level randomization causes spillovers, such as marketplace liquidity, ad auctions, creator ecosystems, or cross-app cannibalization. Randomize matched geos or clusters, measure market-level outcomes like DAU, sessions, ad_impressions, purchases, brand_lift, or revenue, and analyze with cluster-robust uncertainty.
Cannibalization means growth in one surface or source reduces usage elsewhere rather than creating incremental ecosystem value. Test it by defining a total ecosystem metric, e.g. FB_time_spent + IG_time_spent, not only the growing source’s metric. A source can show positive lift while total engagement is flat or negative.
Treatment contamination occurs when control users are exposed indirectly. Examples: friends share treated content, advertisers shift budgets across geos, multi-account users experience a ranking change on one account and adapt behavior on another. Choose coarser randomization, measure exposure, or estimate intent-to-treat rather than exposed-only effects.
Intent-to-treat estimates the effect of assignment, while treatment-on-treated estimates the effect among actually exposed units. ITT is usually safer because exposure is post-treatment and can be endogenous. If reporting TOT, use assignment as an instrument only if exclusion and monotonicity assumptions are plausible.
Metric hierarchy should include a primary decision metric, guardrails, and diagnostics. For ranking or feed changes, primary metrics might be meaningful_interactions, retention, or long-term_sessions; guardrails include hide_rate, report_rate, unfollow_rate, latency, and ecosystem metrics across Facebook and Instagram.
Power and MDE matter more in geo tests because effective sample size is the number of independent clusters, not users. A rough two-arm MDE is:
$MDE \approx (z_{1-\alpha/2} + z_{1-\beta}) \cdot \sqrt{\frac{2\sigma^2}{n}}$
For geo tests, $n$ is geos or matched pairs, and variance reduction from pre-period covariates can be crucial.
Regression adjustment improves precision and handles covariates but does not fix bad identification. A common DID specification is:
$Y_{it} = \alpha_i + \delta_t + \beta(Treat_i \times Post_t) + \epsilon_{it}$
with unit and time fixed effects. Use clustered standard errors at the assignment level.
Logging and instrumentation are analysis inputs, not an engineering design task for DS. You should specify what must be measurable: assignment, exposure, ranking position, eligible population, impressions, clicks, conversions, account switches, and app/session context. Avoid defining treatment based only on observed engagement, which creates selection bias.
Heterogeneous treatment effects are often central at Meta scale. Segment by new vs tenured users, heavy vs light users, creator type, advertiser size, region, platform, and baseline propensity. Pre-register key cuts or use false discovery controls such as Benjamini-Hochberg to avoid cherry-picking wins.

Worked example

For “Prove source growth is cannibalization, not incremental”, a strong candidate would first clarify what the “source” is, what ecosystem outcome matters, and whether there was a launch, ranking change, marketing push, or organic trend. In the first 30 seconds, they should say: “I need to distinguish source-level growth from total incremental value, so I’ll define both a source metric like source_sessions and a total metric like total_sessions or total_time_spent across relevant surfaces.” The answer can be organized into four pillars: metric definition, identification strategy, validity checks, and interpretation.

The candidate might propose a randomized holdout if possible: suppress or delay the source for a random eligible population and compare total ecosystem activity. If randomization is not possible, they can use difference-in-differences around the source expansion, with exposed markets or users as treatment and matched unexposed markets or users as control. They should explicitly test pre-trends for both source and total metrics, because cannibalization requires showing the source rose while total activity did not rise proportionally. A key tradeoff is treatment unit: user-level tests offer power but may miss network or cross-surface spillovers, while geo-level tests reduce contamination but require more time and careful matching. The close should be cautious: “If source usage rises by 10 minutes but total ecosystem time rises by only 1 minute, I would estimate roughly 9 minutes as displaced activity, subject to parallel-trend and spillover assumptions.” If they had more time, they could add heterogeneous effects and longer-run retention to see whether apparent short-term cannibalization becomes incremental habit formation.

A second angle

For “Evaluate brand ads effectiveness on social media causally”, the causal structure is similar but the outcome and randomization constraints differ. Instead of measuring ecosystem usage, the primary outcome might be ad_recall, brand_awareness, favorability, search_lift, or downstream conversion_rate, and many outcomes are noisy or delayed. User-level randomization can work for ad exposure holdouts, but geo or matched-market lift is often preferred when advertisers reallocate budgets or campaigns have broad spillovers. Difference-in-differences can compare treated markets receiving the campaign against matched controls, but the candidate must address concurrent marketing, seasonality, auction dynamics, and survey nonresponse. The same discipline applies: define the estimand, defend the control, show pre-trends, and separate incremental lift from reallocation.

Common pitfalls

Pitfall: Treating correlation as causation because a metric rose after launch.

A weak answer says, “The new source grew 20%, so it worked,” or “brand awareness increased after the campaign, so ads caused it.” A stronger answer asks what would have happened otherwise, defines a counterfactual, and proposes randomization, DID, matched markets, or a placebo test to rule out seasonality and selection.

Pitfall: Optimizing a local metric while ignoring ecosystem tradeoffs.

For Meta, many features shift attention across surfaces, accounts, creators, or apps. If you only measure clicks, source_sessions, or ad_revenue on the treated surface, you may miss lower retention, worse hide_rate, reduced Instagram engagement, or advertiser budget cannibalization. Always include total value and guardrail metrics.

Pitfall: Overusing statistical jargon without stating assumptions.

Saying “I’d run a DID with fixed effects” is not enough. Interviewers want to hear why the treatment and control are comparable, what the parallel-trends evidence would look like, how interference might break the design, and what decision you would make under ambiguous results.

Connections

Interviewers may pivot from here into A/B testing power analysis, metric design, ranking evaluation, synthetic control, instrumental variables, or network interference. They may also ask how you would diagnose a surprising experiment result, reconcile online and offline metrics, or decide whether to launch when primary and guardrail metrics disagree.

What's being tested

Core knowledge

Worked example

A second angle

Common pitfalls

Connections

Further reading

Practice questions

Related concepts