Propensity Score Matching And Observational Causal Inference
Asked of: Data Scientist
Last updated

What's being tested
Google is probing whether you can estimate causal effects from messy product or telemetry data when a clean randomized experiment is unavailable, unsafe, or delayed. The core skill is separating correlation from causation by defining the unit of analysis, treatment, outcome, counterfactual, and likely confounders before choosing a method like propensity score matching, difference-in-differences, or controlled experimentation. Interviewers want to see that you understand both statistical identification and practical product measurement: a method is only credible if assumptions, diagnostics, and limitations are explicit. For a Data Scientist, the expected contribution is not building data pipelines, but designing the analysis, validating assumptions, interpreting uncertainty, and communicating whether the result should influence product decisions.
Core knowledge
-
Propensity score matching estimates causal effects by matching treated and untreated units with similar probability of receiving treatment: . It is useful when treatment assignment is observational but explainable by measured covariates such as geography, device, tenure, baseline usage, or prior performance.
-
The central identification assumption is conditional exchangeability: . In plain English, after conditioning on observed covariates, treated and control units are comparable. This fails if important unobserved factors, such as user intent or local outages, drive both treatment and outcome.
-
Overlap, also called positivity, requires for relevant units. If all high-end devices get a fast experience and all low-end devices get a slow one, matching cannot estimate the effect for low-end devices under fast latency without extrapolation.
-
The usual estimands are ATE, ATT, and ATC. ATE is across everyone; ATT is for treated users. Product questions often care about ATT: “what was the effect on users actually exposed?”
-
A standard PSM workflow is: define pre-treatment covariates, estimate with
logistic regression,random forest, orXGBoost, match treated to controls, check balance, estimate the outcome difference, and compute uncertainty. The propensity model is a balancing tool, not the causal model itself. -
Balance diagnostics matter more than propensity model accuracy. Use standardized mean differences:
A common target is absoluteSMD < 0.1after matching. Also inspect propensity overlap plots and covariate distributions. -
Common matching choices include nearest-neighbor matching, caliper matching, exact matching on critical variables, and stratification by propensity score bins. Calipers such as
0.2standard deviations of the logit propensity are often used to avoid bad matches, at the cost of discarding treated units. -
Inverse probability weighting is a close alternative: weight treated units by and controls by for ATE. For ATT, controls often receive . Weighting can use more data than matching but is sensitive to extreme propensities.
-
Always choose covariates measured before treatment exposure. Conditioning on post-treatment variables, such as session length after latency changed or reviews written after a sales spike, can introduce collider bias or block part of the causal pathway you are trying to estimate.
-
PSM is not a magic replacement for experimentation. It controls only measured confounding, can increase variance by dropping unmatched units, and may perform poorly in high-dimensional sparse settings. If randomization is feasible and ethical, an A/B test remains cleaner.
-
For time-based launches, PSM alone may be insufficient because of time trends, seasonality, and simultaneous product changes. Combine matching with difference-in-differences, synthetic controls, or interrupted time-series logic when pre/post dynamics are central, as with geo usage drops or post-update call drop rates.
-
Report results with uncertainty and sensitivity: confidence intervals, bootstrap standard errors, subgroup robustness, alternative calipers, and falsification checks using pre-treatment outcomes. A credible answer says, “under these assumptions, the estimated lift is X,” not “PSM proves causality.”
Worked example
For Design tests to measure latency impact, a strong candidate should first clarify the user population, the unit of analysis, and whether latency variation is randomized, naturally occurring, or caused by rollout rules. In the first 30 seconds, say: “I’d define treatment as exposure to higher page or API latency during a session, outcome as downstream engagement such as CTR, conversion_rate, watch_time, or abandonment, and I’d separate short-term session effects from user-level retention.” The answer should then organize around four pillars: measurement definition, experimental design if possible, observational causal design if randomization is not possible, and diagnostics/decision criteria.
The cleanest design is a controlled latency injection or traffic-splitting experiment, with guardrails such as p95_latency, error rate, and user harm thresholds. If intentionally slowing users is unsafe or unacceptable, use observational variation: match high-latency sessions to low-latency sessions on pre-treatment covariates like geo, device, browser, network type, time of day, prior engagement, and page type. Estimate a propensity score for receiving high latency, perform caliper matching or weighting, verify covariate balance, and compare outcomes with confidence intervals.
A key tradeoff is session-level versus user-level analysis. Session-level gives more observations but can violate independence because heavy users contribute many sessions; user-level aggregation reduces dependence but may hide acute latency effects. A strong candidate would close by saying: “If I had more time, I’d add heterogeneity analysis by device/network segment, placebo tests on pre-latency outcomes, and compare PSM estimates with difference-in-differences around known latency incidents.”
A second angle
For Establish causality: commute playlist and driving speed, the same causal logic applies, but safety and confounding dominate the framing. Treatment is listening to a commute playlist, and the outcome might be average speed, hard braking, or speeding events; the unit could be trip, driver-day, or driver. A naive comparison between playlist listeners and non-listeners is confounded by route, commute time, driver personality, traffic, weather, and baseline driving behavior. PSM could match playlist trips to non-playlist trips on pre-trip and contextual variables, but the candidate should be cautious: unobserved mood or urgency may still bias results. A randomized recommendation or encouragement design would be more credible, but any experiment must include safety guardrails and avoid inducing risky driving.
Common pitfalls
Pitfall: Treating matching as proof of causality.
A tempting answer is, “I’ll match treated and control users, compare outcomes, and conclude the treatment caused the lift.” That skips the identification assumptions. A stronger answer explicitly says PSM adjusts for observed confounders only, then checks overlap, balance, robustness, and whether unobserved confounding is plausible.
Pitfall: Matching on variables affected by the treatment.
For example, when estimating whether customer reviews affect sales, do not match on post-review traffic, ranking position after reviews changed, or conversion after reviews were visible. Those variables may be mediators or colliders. Use pre-treatment covariates such as historical sales, category, price, brand, baseline rating, seasonality, and prior traffic.
Pitfall: Overcommunicating the method and undercommunicating the decision.
Interviewers do not just want a list of causal techniques. They want to know whether the evidence is strong enough to launch, rollback, investigate, or run a follow-up experiment. Translate the estimate into product language: expected impact on DAU, revenue, retention, or safety metrics, plus uncertainty and caveats.
Connections
Interviewers may pivot from PSM to A/B testing, difference-in-differences, synthetic control, instrumental variables, or regression discontinuity depending on whether treatment was randomized, staggered, threshold-based, or naturally assigned. They may also ask about metric design, variance reduction such as CUPED, ratio metric inference, or heterogeneous treatment effects across geos, devices, and cohorts.
Further reading
-
Rosenbaum and Rubin, “The Central Role of the Propensity Score in Observational Studies for Causal Effects” (1983) — the foundational paper defining propensity scores and balancing properties.
-
Imbens and Rubin, Causal Inference for Statistics, Social, and Biomedical Sciences — rigorous treatment of potential outcomes, matching, weighting, and design-based causal inference.
-
Austin, “An Introduction to Propensity Score Methods for Reducing the Effects of Confounding in Observational Studies” (2011) — practical overview of matching, weighting, balance diagnostics, and common mistakes.
Featured in interview prep guides
Practice questions
- How would you use propensity score matching hereGoogle · Data Scientist · Onsite · medium
- Establish causality: commute playlist and driving speedGoogle · Data Scientist · Technical Screen · Medium
- Diagnose 10–11% usage drop across geosGoogle · Data Scientist · Technical Screen · Medium
- Infer causal impact without an A/B testGoogle · Data Scientist · Technical Screen · hard
- Decide between two vendors under constraintsGoogle · Data Scientist · Onsite · Medium
- Compare two stores’ profits rigorouslyGoogle · Data Scientist · Technical Screen · hard
- Design tests to measure latency impactGoogle · Data Scientist · Onsite · easy
- Analyze Call Drop Rates Pre- and Post-Update ImplementationGoogle · Data Scientist · Technical Screen · medium
- Evaluate College Impact on Income: Address Bias and ValidityGoogle · Data Scientist · Technical Screen · medium
- Analyze Impact of Customer Reviews on Sales PerformanceGoogle · Data Scientist · Technical Screen · medium
Related concepts
- Propensity Score Matching, DiD And Causal InferenceStatistics & Math
- Causal Inference, Confounding, And MatchingAnalytics & Experimentation
- Causal Inference And IdentificationStatistics & Math
- Causal Inference And Difference-In-DifferencesAnalytics & Experimentation
- Propensity Score MatchingStatistics & Math
- Difference-In-Differences And Quasi-ExperimentsAnalytics & Experimentation