Difference-In-Differences And Quasi-Experiments

What's being tested

Interviewers are probing whether you can estimate causal impact when a clean user-level randomized experiment is unavailable, underpowered, contaminated, or ethically impractical. The core skill is choosing and defending a credible identification strategy: defining treatment/control, articulating assumptions, checking pre-trends, estimating lift, and stress-testing whether an observed metric change is incremental or merely shifted from another channel. Meta cares because many high-stakes decisions — ads measurement, commerce surfaces, ranking changes, chatbot launches, marketplace growth — affect networks, merchants, advertisers, and users in ways that can violate simple A/B test assumptions. A strong Data Scientist should translate messy product logs into a causal estimate with clear caveats, not just report before/after deltas.

Core knowledge

Difference-in-differences estimates treatment impact by comparing treated units’ pre/post change to control units’ pre/post change:
$\hat{\tau}_{DiD}=(\bar{Y}_{T,post}-\bar{Y}_{T,pre})-(\bar{Y}_{C,post}-\bar{Y}_{C,pre})$
Use it when treatment exposure is not randomized but there is a plausible untreated comparison group.
Parallel trends is the key assumption: absent treatment, treated and control units would have changed similarly. You do not prove it; you build credibility through pre-period plots, placebo tests, event studies, matched controls, and domain reasoning about seasonality, product launches, and marketing shocks.
Two-way fixed effects regression is the common implementation:
$Y_{it}=\alpha_i+\gamma_t+\beta Treatment_{it}+\epsilon_{it}$
where $\alpha_i$ controls time-invariant unit differences and $\gamma_t$ controls common time shocks. Cluster standard errors at the treatment assignment level, such as geo, advertiser, retailer, or user cohort.
Event studies generalize DiD by estimating effects at each relative time period:
$Y_{it}=\alpha_i+\gamma_t+\sum_{k \neq -1}\beta_k \mathbf{1}[t-T_i=k]+\epsilon_{it}$
Pre-treatment coefficients should be near zero. Post-treatment coefficients show ramp-up, novelty effects, decay, or delayed conversion windows.
Unit of analysis must match the causal question. For ads lift, units might be DMA, country, advertiser, or user; for a commerce tab, units might be user-day or geo-day; for retailer chatbot value, units might be retailer-week. Avoid analyzing at user-level when treatment was assigned at geo-level without cluster-aware inference.
Synthetic control creates a weighted combination of untreated units to match the treated unit’s pre-period trajectory. It is useful with few treated geos or markets, especially for launches where randomization is not possible. It is weaker when there are too few donor units, unstable pre-periods, or spillovers across markets.
Matched-market tests pair similar geos or merchants using pre-period metrics like GMV, purchase_rate, impressions, ad_spend, category mix, and seasonality. Matching improves comparability but does not fix unobserved time-varying confounding; always show balance and pre-trend diagnostics.
Cannibalization analysis separates incremental growth from shifted behavior. If a new source’s GMV rises, check whether total GMV rises or whether other surfaces’ GMV, sessions, or conversions decline. A credible DiD uses total outcome as primary and channel mix as diagnostic, not the new source’s growth alone.
Spillovers and interference are common at Meta scale. Users influence friends, advertisers reallocate budgets, merchants change inventory, and ranking changes affect competing accounts. If treatment affects controls, standard DiD underestimates or misattributes effects; consider geo-level assignment, network-exposure definitions, or excluding high-contamination segments.
Metric design should include primary, guardrail, and mechanism metrics. For commerce or ads, primary might be incremental revenue, GMV, purchase_rate, or brand_lift; guardrails include hide_rate, report_rate, retention, advertiser ROI, merchant churn, and user experience; mechanism metrics include CTR, conversion_rate, AOV, and exposure.
Power and uncertainty matter more than point estimates. For geo experiments and DiD, effective sample size is often the number of independent clusters, not raw users. Use historical variance, intra-cluster correlation, and pre-period covariance reduction such as CUPED where appropriate: $Y^{adj}=Y-\theta(X-\bar{X})$ .
Robustness checks should be planned before recommending launch. Use alternative control groups, different pre/post windows, placebo treatment dates, leave-one-geo-out sensitivity, winsorization for outliers, segment cuts, and multiple-testing discipline such as Benjamini-Hochberg or Bonferroni correction when many outcomes are inspected.

Worked example

For “Prove source growth is cannibalization, not incremental”, a strong candidate should first clarify what “source” means, what total business outcome matters, whether the source was launched everywhere or in a staggered way, and whether there are known concurrent changes such as ranking, marketing, or notification pushes. They would declare that source-level growth alone is not evidence of incrementality; the estimand is the effect on total GMV, revenue, purchases, or sessions, not just the treated channel.

The answer skeleton should have four pillars: define the causal estimand and metrics, construct treatment/control units, estimate a DiD or matched-market model, and validate assumptions with diagnostics. For example, if the source launched in some geos first, use geo-week data with treated geos versus matched untreated geos and estimate $Y_{gt}=\alpha_g+\gamma_t+\beta Launch_{gt}+\epsilon_{gt}$ for total GMV, while separately modeling other-source GMV as a mechanism metric.

A strong candidate would explicitly flag that the control group must be exposed to the same macro seasonality and demand shocks but not the launch itself. They would show pre-trend plots for total GMV and major component sources, run placebo launch dates, and check whether the “new” source’s gain is offset by declines in organic, search, feed, or existing shopping surfaces. One key tradeoff is whether to use a broad total-business metric, which best captures incrementality but may be noisy, versus a narrower funnel metric, which is more sensitive but easier to misinterpret.

They should close with a decision-oriented statement: if total GMV does not move while source GMV rises and other sources fall by a similar amount, the evidence supports cannibalization; if total GMV rises with stable guardrails, the source is likely incremental. If they had more time, they would add segment analysis by user intent, category, and acquisition channel, plus sensitivity checks for attribution windows and spillover across geos.

A second angle

For “Evaluate brand ads effectiveness on social media causally”, the same causal logic applies, but the outcome is often harder to observe and the treatment is advertising exposure rather than a product launch. A user-level A/B test may be biased if advertisers optimize delivery toward likely converters, so a geo or matched-market lift test can better preserve exogenous variation. The primary outcome may be brand_awareness, ad_recall, search lift, site visits, or conversions, with survey-based outcomes requiring careful weighting and nonresponse checks. Instead of asking whether one source cannibalized another, you ask whether exposed markets improved more than comparable unexposed markets after controlling for seasonality, baseline brand strength, and concurrent campaigns. The candidate should emphasize incrementality over attributed conversions, because last-click or view-through attribution can massively overstate causal ad value.

Common pitfalls

Pitfall: Treating before/after change as causal.

A tempting answer is “revenue increased 8% after launch, so the product worked.” That ignores seasonality, macro trends, Meta-wide product changes, marketing campaigns, and regression to the mean. A better answer compares treated units to credible controls and explains why the counterfactual trend is plausible.

Pitfall: Overclaiming from a sophisticated model.

Using propensity scores, synthetic control, or a fixed-effects regression does not automatically make an estimate causal. Interviewers will push on assumptions: Are there pre-trend differences? Did treatment timing respond to expected growth? Are controls contaminated? Strong candidates state what each method adjusts for and what it cannot fix.

Pitfall: Communicating only equations, not a launch recommendation.

For Meta DS interviews, the deliverable is usually a product or business decision. Do not stop at $\beta=0.03, p<0.05$ ; translate into incremental revenue, confidence interval, downside risk, segment heterogeneity, and guardrail impact. The best answer says what evidence would justify launch, rollback, or a follow-up experiment.

Connections

Interviewers often pivot from quasi-experiments to A/B testing, power analysis, metric design, selection bias, attribution, and ranking evaluation. Be ready to explain when a randomized experiment is superior, when it is infeasible, and how observational estimates can be used as directional evidence before a cleaner test. For commerce and ads cases, expect follow-ups on incrementality, cannibalization, marketplace equilibrium, and heterogeneous treatment effects.

What's being tested

Core knowledge

Worked example

A second angle

Common pitfalls

Connections

Further reading

Practice questions

Related concepts