Quasi-Experimental Designs: Instrumental Variables And Regression Discontinuity
Asked of: Data Scientist
Last updated
What's being tested
Interviewers are probing whether you can make causal claims when a clean randomized experiment is unavailable, unethical, underpowered, or contaminated by user self-selection. For a Meta Data Scientist, this matters because many product questions involve opt-in features, eligibility thresholds, policy constraints, ranking cutoffs, or staggered launches where naive metric comparisons are biased. You need to know when instrumental variables and regression discontinuity are credible, what assumptions they require, how to validate those assumptions with product data, and how to communicate a limited-but-useful causal estimate. The strongest answers separate “can estimate an association” from “can identify a causal effect,” then propose diagnostics, robustness checks, and decision-relevant metrics like DAU, sessions/user, time_spent, retention, or revenue.
Core knowledge
-
Instrumental variables estimate causal effects when treatment uptake is endogenous, such as users choosing whether to enable notifications or creators choosing whether to adopt a monetization tool. A valid instrument affects treatment but does not directly affect the outcome except through treatment.
-
The core IV assumptions are relevance, exclusion restriction, independence, and monotonicity. Relevance means ; exclusion means affects only through ; independence means is as-if random; monotonicity means no “defiers” who do the opposite of assignment.
-
The simple Wald estimator is:
where is the instrument, is treatment, and is the outcome. It estimates the local average treatment effect for compliers, not necessarily all users. -
Two-stage least squares is the standard IV implementation. First stage: predict treatment from the instrument, . Second stage: regress outcome on predicted treatment, . Use robust or clustered standard errors when observations are correlated.
-
A weak instrument creates unstable, biased estimates. A common rule of thumb is first-stage F-statistic greater than about 10 for a single instrument, though stronger is better. In product settings, weak instruments often arise when eligibility or assignment barely changes actual feature usage.
-
Regression discontinuity applies when treatment changes sharply at a threshold in a running variable, such as age, score, risk rating, quality rank, friend-count tier, or eligibility cutoff. Units just above and below the cutoff are treated as locally comparable, creating quasi-random variation.
-
In a sharp RD, treatment deterministically changes at the cutoff: . In a fuzzy RD, probability of treatment jumps at the cutoff but not from 0 to 1, so the cutoff acts like an instrument and the estimand is a local treatment effect near the threshold.
-
RD estimates are local. If a creator tool is available only above 10,000 followers, the estimated effect applies to creators near 10,000 followers, not celebrities with 10 million followers or new creators with 100 followers. This limitation should be stated explicitly.
-
The key RD design choices are bandwidth, functional form, and kernel weighting. Narrow bandwidth improves comparability but increases variance; wide bandwidth improves power but risks bias. A strong answer proposes sensitivity checks across bandwidths and local linear regressions on both sides of the cutoff.
-
RD validity depends on no precise manipulation of the running variable. If users or creators can game the threshold, estimates may be biased. Check density around the cutoff using a McCrary density test, inspect bunching, and test whether pre-treatment covariates are smooth across the cutoff.
-
Good quasi-experimental analysis uses placebo tests and negative controls. For RD, test fake cutoffs where no treatment change occurred. For IV, test whether the instrument predicts pre-treatment outcomes or covariates it should not affect.
-
Both IV and RD should start with a causal graph or clear identification story. Controls should improve precision or adjust residual imbalance, not “fix” a broken design. If the exclusion restriction or cutoff continuity assumption is not believable, no amount of regression sophistication saves the causal claim.
Worked example
Estimating the impact of notification opt-in when users self-select into treatment
A strong candidate would first clarify the treatment, outcome, and decision: “Are we estimating the effect of receiving notifications on 7-day retention, sessions/user, or time_spent, and is the goal to decide whether to expand prompts or change notification ranking?” They would immediately flag that comparing opt-in users to non-opt-in users is biased because users who opt in are likely more engaged, more tolerant of notifications, or different in privacy preferences. The answer can be organized around four pillars: define the causal estimand, identify a plausible instrument, validate assumptions, and quantify robustness.
A plausible instrument might be randomized exposure to an opt-in prompt, eligibility for a notification permission surface, or platform-level variation in whether a user is shown the prompt. The first-stage check is whether the instrument meaningfully changes actual notification opt-in or notification receipt. The exclusion restriction is the hard part: if the prompt itself increases engagement by reminding users about the app, then it affects outcomes outside the treatment path and is not a clean instrument. The candidate should explain that the IV estimate is a local average treatment effect for users induced to opt in by the prompt, not all Meta users.
One explicit tradeoff is interpretability versus credibility: a broad observational model may cover the full population but be confounded, while IV gives a narrower estimate with stronger causal interpretation if assumptions hold. The candidate should close by saying they would run covariate balance checks across instrument groups, placebo tests on pre-period engagement, sensitivity analyses by geography/device cohort, and compare the IV estimate to any available randomized experiment or holdout. If they had more time, they would examine heterogeneous effects, such as new versus long-tenured users, because notification impact can differ sharply by lifecycle stage.
A second angle
Measuring the effect of a creator feature available above a follower-count threshold
This is an RD framing rather than an IV-first framing because treatment eligibility changes at a known cutoff, such as creators with at least 10,000 followers gaining access to a monetization or analytics tool. The candidate should define the running variable as follower count, the cutoff as 10,000, and the outcome as something like creator_posts/week, reels_uploads, earnings, or creator_retention. The main identification claim is that creators just below and just above 10,000 followers are comparable except for access to the feature. The answer should include checks for manipulation, because creators may campaign to cross the threshold or Meta ranking systems may amplify users near it. Unlike the notification IV example, the estimate is explicitly local to creators around the cutoff and may not generalize to very small or very large creators.
Common pitfalls
Pitfall: Treating quasi-experimental designs as “basically A/B tests.”
IV and RD require stronger explanation because the identifying assumptions are not guaranteed by design in the same way randomization is. A tempting but weak answer is, “We can just compare users above and below the cutoff”; a stronger answer says, “We can compare users locally around the cutoff after checking continuity, manipulation, and bandwidth sensitivity.”
Pitfall: Ignoring the estimand.
A common analytical mistake is reporting “the effect of the feature” when the estimate is actually a local effect. IV estimates effects for compliers; RD estimates effects near the threshold. Meta interviewers will expect you to say who the estimate applies to and whether that population matches the product decision.
Pitfall: Over-indexing on formulas without defending assumptions.
Knowing 2SLS or local linear regression is useful, but the interview is not a statistics exam. The better signal is whether you can explain why the instrument is credible, why the cutoff is not manipulated, what placebo tests you would run, and what decision risk remains if assumptions are only partially believable.
Connections
Interviewers may pivot from this topic into difference-in-differences, synthetic controls, propensity score methods, or standard A/B testing when asking how you would triangulate evidence. They may also ask about metric selection, heterogeneous treatment effects, power, guardrail metrics, or how to communicate uncertainty to product and engineering partners.
Further reading
-
Mostly Harmless Econometrics — canonical treatment of IV, RD, and practical causal identification.
-
Causal Inference: The Mixtape — accessible applied explanations with examples and diagnostics.
-
Imbens and Lemieux, “Regression Discontinuity Designs: A Guide to Practice” — practical reference for RD assumptions, bandwidths, and validity checks.
Related concepts
- Difference-In-Differences And Quasi-ExperimentsAnalytics & Experimentation
- Causal Inference And Quasi-ExperimentsAnalytics & Experimentation
- Propensity Score Matching, DiD And Causal InferenceStatistics & Math
- Difference-In-DifferencesStatistics & Math
- Causal Inference And Difference-In-DifferencesAnalytics & Experimentation
- Difference-In-Differences And Staggered RolloutsStatistics & Math