Propensity Score Matching

What's being tested

Interviewers are testing whether you can use propensity score matching as part of a credible observational causal study, not just describe it mechanically. For an Amazon Data Scientist, this matters because many decisions involve non-random exposure: customers receive reminders, recommendations, coupons, ads, or seller interventions based on prior behavior, which creates selection bias. The interviewer is probing whether you can define treatment and outcome cleanly, identify confounders, estimate treatment effects under explicit assumptions, diagnose balance, and explain when matching is weaker than alternatives like difference-in-differences or randomized experiments. A strong answer sounds like a causal analysis plan, not a generic ML workflow.

Core knowledge

Propensity score is the probability of receiving treatment conditional on observed covariates: $e(X)=P(T=1 \mid X)$ . Matching on $e(X)$ compresses many covariates into one score, making treated and control units more comparable under the right assumptions.
Treatment effect estimands must be named before modeling. ATT estimates impact on treated users, $E[Y(1)-Y(0)\mid T=1]$ ; ATE estimates impact over the full population. Reminder, coupon, or recommendation analyses often care about ATT because the business asks, “What was the effect on customers who actually received it?”
Unconfoundedness means potential outcomes are independent of treatment after conditioning on observed covariates: $(Y(1),Y(0)) \perp T \mid X$ . This is untestable and is the central weakness of matching; missing drivers like purchase intent, seasonality, or prior reminder eligibility can invalidate the estimate.
Common support or overlap requires comparable treated and control units at similar propensity scores: $0 < P(T=1\mid X) < 1$ . If high-propensity treated users have no comparable controls, trim them or report that the effect is only identified for the overlap population.
Covariate selection should include pre-treatment variables that affect both treatment and outcome, such as prior purchases, browse frequency, tenure, Prime status, device type, geography, prior notification engagement, and baseline spend. Do not control for post-treatment mediators like clicks caused by the reminder.
Matching methods include nearest-neighbor matching, caliper matching, exact matching on key strata, Mahalanobis distance within propensity calipers, and matching with or without replacement. A common rule is a caliper of $0.2$ standard deviations of the logit propensity score, though sensitivity checks matter more than blindly applying the rule.
Balance diagnostics are more important than predictive accuracy. Use standardized mean difference: $SMD = \frac{\bar X_T-\bar X_C}{\sqrt{(s_T^2+s_C^2)/2}}$ and aim for absolute SMD below roughly 0.1 after matching. Also inspect propensity score overlap plots and variance ratios.
Propensity model quality is not judged by AUC alone. A very high AUC can indicate treated and control groups are hard to compare, worsening overlap. Logistic regression is interpretable; XGBoost or random forests can capture nonlinear assignment, but still require balance checks.
Effect estimation after matching usually compares matched outcomes: $\hat\tau_{ATT} = \frac{1}{N_T}\sum_{i:T_i=1}(Y_i-\sum_{j:T_j=0}w_{ij}Y_j)$ . Use robust or bootstrap standard errors because matching induces dependence between observations.
Weights versus matches are related but not identical. Inverse probability weighting uses weights like $T/e(X)+(1-T)/(1-e(X))$ for ATE, while matching forms comparable pairs or sets. Weighting can be efficient but unstable when scores are near 0 or 1.
Time alignment is critical for DS causal work. Define covariates using only data before treatment exposure, define a clean treatment timestamp, and measure outcomes over a fixed post-period such as 7-day conversion, 14-day revenue, or weekly active users.
Scale considerations matter in implementation but not as data engineering design. For millions of users, estimate scores in Python with sklearn or statsmodels, then use approximate nearest-neighbor libraries or stratified matching; for tens of millions, coarsened strata or weighting may be more practical than full pairwise matching.

Worked example

For Design causal study for reminder impact, a strong candidate would start by clarifying: “What reminder are we studying, who was eligible, when was it sent, and what primary outcome matters: purchase_conversion, GMV, repeat_visit, or retention?” They would declare that randomization is preferred, but if the reminder was already deployed non-randomly, they would frame this as an observational causal inference problem.

The answer skeleton should have four pillars. First, define treatment as receiving the reminder during a specific window and define the outcome over a fixed post-treatment horizon. Second, construct pre-treatment covariates capturing baseline customer behavior: prior sessions, purchases, category interest, reminder history, Prime status, tenure, and seasonality indicators. Third, estimate propensity scores and perform matching or weighting while checking common support and covariate balance. Fourth, estimate ATT and run robustness checks such as placebo outcomes, alternative calipers, trimming, and segment-level heterogeneity.

One tradeoff to flag is whether to use propensity score matching alone or combine it with difference-in-differences. If treated users were already trending differently before the reminder, simple matching on static covariates may not remove bias; matching users on baseline features and then applying DiD on pre/post outcomes can be stronger if parallel trends look plausible.

A good close would be: “If I had more time, I would compare this observational estimate against any historical holdout or staggered rollout, because a randomized or quasi-random design would be more credible than relying entirely on selection-on-observables.”

A second angle

For Explain Propensity Score Matching and Assess Covariate Balance, the emphasis shifts from study design to technical diagnostics. You should define the score, explain why matching reduces observed covariate imbalance, and immediately state the assumptions: unconfoundedness, overlap, and correct time ordering. The interviewer may push on how you know matching worked; the right answer is not “the propensity model has high accuracy,” but “post-match SMDs are small, score distributions overlap, and key business covariates are balanced.” You can also mention that if balance remains poor, you would revise covariates, use exact matching on critical dimensions like marketplace or customer segment, change calipers, trim non-overlap, or move to a different design.

Common pitfalls

Pitfall: Treating PSM like a magic bias-removal tool.

The tempting answer is “we match treated and untreated users, then compare outcomes, so we have causality.” That misses the key limitation: matching only adjusts for observed pre-treatment confounders. A better answer explicitly says what unobserved factors could remain, such as customer intent, concurrent promotions, or recommendation ranking changes.

Pitfall: Optimizing the propensity model for prediction instead of balance.

A candidate may emphasize AUC, feature importance, or complex models because that sounds rigorous. For causal matching, predictive performance is secondary; the interviewer wants to hear balance diagnostics, common support, SMD thresholds, and sensitivity checks.

Pitfall: Communicating only formulas without business framing.

It is not enough to write $e(X)=P(T=1\mid X)$ and list assumptions. Tie the method to a real decision: whether reminders incrementally increase purchases, whether the estimate applies only to reachable users, and whether the expected lift justifies expanding the program.

Connections

Interviewers often pivot from propensity score matching to difference-in-differences, synthetic control, instrumental variables, uplift modeling, or double machine learning. They may also ask how your observational estimate compares with an A/B test, especially around selection bias, interference, novelty effects, and metric guardrails.

What's being tested

Core knowledge

Worked example

A second angle

Common pitfalls

Connections

Further reading

Featured in interview prep guides

Practice questions

Related concepts