Propensity Score Matching And Observational Causal Inference

What's being tested

Google is probing whether you can estimate causal effects from messy product or telemetry data when a clean randomized experiment is unavailable, unsafe, or delayed. The core skill is separating correlation from causation by defining the unit of analysis, treatment, outcome, counterfactual, and likely confounders before choosing a method like propensity score matching, difference-in-differences, or controlled experimentation. Interviewers want to see that you understand both statistical identification and practical product measurement: a method is only credible if assumptions, diagnostics, and limitations are explicit. For a Data Scientist, the expected contribution is not building data pipelines, but designing the analysis, validating assumptions, interpreting uncertainty, and communicating whether the result should influence product decisions.

Core knowledge

Propensity score matching estimates causal effects by matching treated and untreated units with similar probability of receiving treatment: $e(X)=P(T=1 \mid X)$ . It is useful when treatment assignment is observational but explainable by measured covariates such as geography, device, tenure, baseline usage, or prior performance.
The central identification assumption is conditional exchangeability: $(Y(1),Y(0)) \perp T \mid X$ . In plain English, after conditioning on observed covariates, treated and control units are comparable. This fails if important unobserved factors, such as user intent or local outages, drive both treatment and outcome.
Overlap, also called positivity, requires $0 < P(T=1 \mid X) < 1$ for relevant units. If all high-end devices get a fast experience and all low-end devices get a slow one, matching cannot estimate the effect for low-end devices under fast latency without extrapolation.
The usual estimands are ATE, ATT, and ATC. ATE is $E[Y(1)-Y(0)]$ across everyone; ATT is $E[Y(1)-Y(0)\mid T=1]$ for treated users. Product questions often care about ATT: “what was the effect on users actually exposed?”
A standard PSM workflow is: define pre-treatment covariates, estimate $e(X)$ with logistic regression, random forest, or XGBoost, match treated to controls, check balance, estimate the outcome difference, and compute uncertainty. The propensity model is a balancing tool, not the causal model itself.
Balance diagnostics matter more than propensity model accuracy. Use standardized mean differences:
$SMD=\frac{\bar X_T-\bar X_C}{\sqrt{(s_T^2+s_C^2)/2}}$
A common target is absolute SMD < 0.1 after matching. Also inspect propensity overlap plots and covariate distributions.
Common matching choices include nearest-neighbor matching, caliper matching, exact matching on critical variables, and stratification by propensity score bins. Calipers such as 0.2 standard deviations of the logit propensity are often used to avoid bad matches, at the cost of discarding treated units.
Inverse probability weighting is a close alternative: weight treated units by $1/e(X)$ and controls by $1/(1-e(X))$ for ATE. For ATT, controls often receive $e(X)/(1-e(X))$ . Weighting can use more data than matching but is sensitive to extreme propensities.
Always choose covariates measured before treatment exposure. Conditioning on post-treatment variables, such as session length after latency changed or reviews written after a sales spike, can introduce collider bias or block part of the causal pathway you are trying to estimate.
PSM is not a magic replacement for experimentation. It controls only measured confounding, can increase variance by dropping unmatched units, and may perform poorly in high-dimensional sparse settings. If randomization is feasible and ethical, an A/B test remains cleaner.
For time-based launches, PSM alone may be insufficient because of time trends, seasonality, and simultaneous product changes. Combine matching with difference-in-differences, synthetic controls, or interrupted time-series logic when pre/post dynamics are central, as with geo usage drops or post-update call drop rates.
Report results with uncertainty and sensitivity: confidence intervals, bootstrap standard errors, subgroup robustness, alternative calipers, and falsification checks using pre-treatment outcomes. A credible answer says, “under these assumptions, the estimated lift is X,” not “PSM proves causality.”

Worked example

For Design tests to measure latency impact, a strong candidate should first clarify the user population, the unit of analysis, and whether latency variation is randomized, naturally occurring, or caused by rollout rules. In the first 30 seconds, say: “I’d define treatment as exposure to higher page or API latency during a session, outcome as downstream engagement such as CTR, conversion_rate, watch_time, or abandonment, and I’d separate short-term session effects from user-level retention.” The answer should then organize around four pillars: measurement definition, experimental design if possible, observational causal design if randomization is not possible, and diagnostics/decision criteria.

The cleanest design is a controlled latency injection or traffic-splitting experiment, with guardrails such as p95_latency, error rate, and user harm thresholds. If intentionally slowing users is unsafe or unacceptable, use observational variation: match high-latency sessions to low-latency sessions on pre-treatment covariates like geo, device, browser, network type, time of day, prior engagement, and page type. Estimate a propensity score for receiving high latency, perform caliper matching or weighting, verify covariate balance, and compare outcomes with confidence intervals.

A key tradeoff is session-level versus user-level analysis. Session-level gives more observations but can violate independence because heavy users contribute many sessions; user-level aggregation reduces dependence but may hide acute latency effects. A strong candidate would close by saying: “If I had more time, I’d add heterogeneity analysis by device/network segment, placebo tests on pre-latency outcomes, and compare PSM estimates with difference-in-differences around known latency incidents.”

A second angle

For Establish causality: commute playlist and driving speed, the same causal logic applies, but safety and confounding dominate the framing. Treatment is listening to a commute playlist, and the outcome might be average speed, hard braking, or speeding events; the unit could be trip, driver-day, or driver. A naive comparison between playlist listeners and non-listeners is confounded by route, commute time, driver personality, traffic, weather, and baseline driving behavior. PSM could match playlist trips to non-playlist trips on pre-trip and contextual variables, but the candidate should be cautious: unobserved mood or urgency may still bias results. A randomized recommendation or encouragement design would be more credible, but any experiment must include safety guardrails and avoid inducing risky driving.

Common pitfalls

Pitfall: Treating matching as proof of causality.

A tempting answer is, “I’ll match treated and control users, compare outcomes, and conclude the treatment caused the lift.” That skips the identification assumptions. A stronger answer explicitly says PSM adjusts for observed confounders only, then checks overlap, balance, robustness, and whether unobserved confounding is plausible.

Pitfall: Matching on variables affected by the treatment.

For example, when estimating whether customer reviews affect sales, do not match on post-review traffic, ranking position after reviews changed, or conversion after reviews were visible. Those variables may be mediators or colliders. Use pre-treatment covariates such as historical sales, category, price, brand, baseline rating, seasonality, and prior traffic.

Pitfall: Overcommunicating the method and undercommunicating the decision.

Interviewers do not just want a list of causal techniques. They want to know whether the evidence is strong enough to launch, rollback, investigate, or run a follow-up experiment. Translate the estimate into product language: expected impact on DAU, revenue, retention, or safety metrics, plus uncertainty and caveats.

Connections

Interviewers may pivot from PSM to A/B testing, difference-in-differences, synthetic control, instrumental variables, or regression discontinuity depending on whether treatment was randomized, staggered, threshold-based, or naturally assigned. They may also ask about metric design, variance reduction such as CUPED, ratio metric inference, or heterogeneous treatment effects across geos, devices, and cohorts.

What's being tested

Core knowledge

Worked example

A second angle

Common pitfalls

Connections

Further reading

Featured in interview prep guides

Practice questions

Related concepts