Causal Inference And Quasi-Experiments

What's being tested

Causal inference and quasi-experimental reasoning test whether you can estimate “what would have happened otherwise” when Pinterest cannot run a clean randomized experiment, or when an A/B test has complications. Interviewers are probing whether you can separate correlation from causation, define a credible counterfactual, choose metrics like CTR, saves, repins, session_length, or long_clicks, and explain assumptions clearly. Pinterest cares because feed ranking, home surface changes, and video-pin launches can affect user engagement, creator distribution, and long-term retention in ways that are hard to evaluate from raw metric movement alone. A strong Data Scientist answer combines statistical identification, experiment hygiene, metric interpretation, and product-aware diagnostics.

Core knowledge

Randomized controlled trials are the gold standard because treatment assignment is independent of potential outcomes: $T \perp (Y(1), Y(0))$ . For feed or homepage changes, define the unit of randomization carefully: user-level randomization avoids cross-session contamination, while item-level randomization can create interference.
Average treatment effect is usually framed as $ATE = E[Y(1) - Y(0)]$ , but in product experiments you often estimate an intent-to-treat effect: impact of assignment, not necessarily exposure. This matters when only some users assigned to a new video-pin module actually see it.
Difference-in-differences estimates causal impact using pre/post changes between treated and comparison groups:
$\hat{\tau}_{DiD} = (\bar{Y}_{T,post} - \bar{Y}_{T,pre}) - (\bar{Y}_{C,post} - \bar{Y}_{C,pre})$
Its key assumption is parallel trends: absent treatment, groups would have moved similarly.
Synthetic control builds a weighted combination of untreated groups to match the treated unit’s pre-period trajectory. It is useful when one surface, geography, cohort, or platform receives a homepage change. It needs enough pre-period data and a donor pool not affected by spillovers.
Interrupted time series can be used when there is no control group, modeling a level or slope break at launch time. It is weaker than DiD because it assumes no simultaneous shocks, seasonality shifts, logging changes, marketing campaigns, or ranking updates affected the same metrics.
Propensity score methods estimate treatment probability $e(X)=P(T=1|X)$ from observed covariates, then use matching, stratification, or inverse probability weighting. They help with observational exposure, but only adjust for observed confounders; unmeasured intent or creator quality can still bias estimates.
Regression adjustment uses models like linear regression, logistic regression, or doubly robust estimators to control for pre-treatment covariates such as historical sessions_per_user, prior saves, country, device, and tenure. Never control for post-treatment variables like post-launch engagement path or exposure depth.
CUPED reduces variance in A/B tests using a pre-treatment covariate: $Y_{adj}=Y-\theta(X-\bar{X})$ , where $\theta = Cov(Y,X)/Var(X)$ . It is powerful for Pinterest metrics with strong user-level autocorrelation, such as historical engagement or save propensity.
Power and MDE connect sample size to detectable lift. For a difference in means, roughly $n \propto \sigma^2(z_{1-\alpha/2}+z_{1-\beta})^2/\delta^2$ . For low-base-rate metrics like video_save_rate, small relative lifts may require long runtimes or aggregated user-level outcomes.
Experiment diagnostics are not optional: check sample ratio mismatch, pre-period balance, missing metric events, exposure imbalance, novelty effects, day-of-week seasonality, and guardrail regressions. A statistically significant lift is not trustworthy if assignment, logging, or eligibility was biased.
Metric design should separate primary success metrics, guardrails, and diagnostics. For Pinterest, a launch might optimize saves_per_user or long_click_rate, guardrail hide_rate, unfollow_rate, latency, or 7d_retention, and inspect diagnostics by platform, country, content type, and user tenure.
Uncertainty quantification should match the design. Use user-level standard errors for user-randomized experiments, cluster-robust errors for geo or creator-level treatments, bootstrap for complex ratio metrics, and multiple-testing corrections such as Bonferroni or Benjamini-Hochberg when slicing many cohorts.

Worked example

For Investigate Homepage Experiment Without Control Group: Methods and Metrics, a strong candidate starts by saying: “I’d first clarify what changed, when it launched, who was eligible, whether rollout was staggered, and what decision we need to make: estimate causal impact, diagnose metric movement, or decide whether to keep the module.” They would define the unit of analysis, likely user-day or user-week, and distinguish assigned users from actually exposed users to avoid conditioning on treatment-induced behavior. The answer can be organized into four pillars: first, metric definition; second, causal identification strategy; third, validation and falsification; fourth, segmentation and product diagnosis.

For metrics, they might propose homepage_engaged_sessions, pin_saves_per_user, outbound_click_rate, hide_rate, and 7d_retention, with one primary metric and several guardrails. For identification, if there is no randomized control, they would look for a natural comparison: unaffected countries, platforms, tenure cohorts, or users below an eligibility threshold, then use difference-in-differences or synthetic control. If no comparison exists, they would use interrupted time series, but explicitly label it weaker and spend more time ruling out concurrent shocks. A key tradeoff is bias versus variance: a tightly matched comparison cohort may be smaller and noisier, while a broader comparison group has more power but weaker parallel-trends credibility. They would validate using pre-trend plots, placebo launch dates, placebo metrics that should not move, and cohort balance on historical engagement. They would close by saying: “If I had more time, I’d test robustness across multiple counterfactual methods and estimate heterogeneous effects for new versus power users, mobile versus web, and video-heavy versus shopping-heavy sessions.”

A second angle

For Evaluate New Feed-Ranking Algorithm with A/B Testing, the same causal logic applies, but the design should start from randomized assignment rather than observational recovery. The candidate should frame the estimand as the causal effect of assignment to the new ranking algorithm on user-level metrics like saves_per_session, long_click_rate, session_depth, and 7d_retention. The emphasis shifts toward experiment design: unit of randomization, power, MDE, ramp plan, sample ratio mismatch, novelty effects, and guardrails such as hide_rate or creator-side distribution. Quasi-experimental methods still matter if the A/B test is compromised: for example, if treatment traffic was accidentally overrepresented on iOS, regression adjustment, reweighting, or stratified analysis can help diagnose sensitivity. The strongest answer makes clear that clean randomization beats post-hoc adjustment, but good diagnostics determine whether the randomized estimate is credible.

Common pitfalls

Pitfall: Treating pre/post movement as causal.

A tempting weak answer is: “The homepage metric rose 4% after launch, so the treatment caused a 4% lift.” That ignores seasonality, marketing campaigns, ranking changes, creator supply shifts, and user mix changes. A better answer defines a counterfactual and explains why its assumptions are plausible or not.

Pitfall: Jumping to methods without clarifying the estimand.

Candidates often list DiD, synthetic control, propensity matching, and regression without saying what effect they are estimating. Interviewers want to hear whether you mean effect of assignment, exposure, treatment-on-treated, short-term engagement, or long-term retention. Start with the decision and metric, then choose the method.

Pitfall: Overclaiming from observational adjustment.

Propensity scores and regressions can sound rigorous, but they do not solve unobserved confounding. For example, users who see more video pins may already prefer video, so higher engagement among exposed users is not automatically causal. Strong candidates discuss sensitivity checks, negative controls, placebo tests, and the limits of the evidence.

Connections

Interviewers may pivot from here into A/B testing, metric design, power analysis, ranking evaluation, or heterogeneous treatment effects. They may also ask how you would combine offline model metrics like NDCG or AUC with online product metrics such as saves, CTR, and retention. Be ready to explain when a randomized experiment is necessary versus when a quasi-experiment is acceptable.

What's being tested

Core knowledge

Worked example

A second angle

Common pitfalls

Connections

Further reading

Featured in interview prep guides

Practice questions

Related concepts