Propensity Score Matching, DiD And Causal Inference

What's being tested

Interviewers are probing whether you can estimate causal impact when randomization is unavailable or imperfect, not just whether you can run a regression. For a TikTok Data Scientist, this matters because many product, ads, commerce, and notification decisions are made from logged observational data where exposed users differ systematically from unexposed users. You need to define the treatment, outcome, unit of analysis, and time window, then choose between methods like propensity score matching, inverse probability weighting, difference-in-differences, or doubly robust estimation. Strong answers show assumptions, diagnostics, sensitivity checks, and how the estimate would guide a product decision.

Core knowledge

Causal estimands must be explicit: ATE estimates $E[Y(1)-Y(0)]$ for the whole population, while ATT estimates $E[Y(1)-Y(0)\mid T=1]$ for treated users. For TikTok product launches, ATT is often natural if asking “what happened to users who received the feature?”
Potential outcomes frame the missing-data problem: each user has $Y_i(1)$ and $Y_i(0)$ , but only one is observed. The causal effect is not simply $E[Y\mid T=1]-E[Y\mid T=0]$ unless treatment is independent of potential outcomes.
Confounding occurs when pre-treatment variables affect both treatment and outcome. For example, heavy TikTok users may be more likely to receive notifications and also more likely to retain, inflating the apparent impact of notification exposure on D7_retention.
Propensity score is $e(X)=P(T=1\mid X)$ , usually estimated with logistic regression, XGBoost, or calibrated tree models. It reduces many covariates into one balancing score, but it only adjusts for observed pre-treatment covariates, not hidden motivation or intent.
Matching pairs treated and control units with similar propensity scores or covariates. Common choices are nearest-neighbor matching, caliper matching, exact matching on key segments, and matching with replacement. For millions of users, exact pairwise matching can be expensive; use blocking, calipers, or approximate nearest neighbors.
Balance diagnostics matter more than propensity model AUC. Check standardized mean differences before and after matching:
$SMD = \frac{\bar X_T-\bar X_C}{\sqrt{(s_T^2+s_C^2)/2}}$
A common target is absolute SMD < 0.1 for important covariates.
Overlap, also called positivity, requires comparable treated and control users across the covariate space: $0 < P(T=1\mid X) < 1$ . If high-value creators always receive a feature and low-activity users never do, causal effects are only credible within the overlapping segment.
Difference-in-differences estimates treatment impact by comparing outcome changes over time:
$\hat\tau = (\bar Y_{T,post}-\bar Y_{T,pre})-(\bar Y_{C,post}-\bar Y_{C,pre})$
It is useful when treated and control groups differ in levels but would have followed parallel trends absent treatment.
Parallel trends should be tested visually and quantitatively using multiple pre-periods. If treated users’ watch_time or repurchase_rate was already accelerating before exposure, a simple DiD estimate may attribute an existing trend to the intervention.
Regression adjustment can complement matching or DiD. A typical DiD regression is $Y_{it}=\alpha_i+\gamma_t+\tau(T_i \times Post_t)+\epsilon_{it}$ with user and time fixed effects. Use clustered standard errors at the user, creator, market, or campaign level when outcomes are correlated.
Doubly robust estimation combines propensity weighting with outcome modeling. Methods like AIPW remain consistent if either the propensity model or outcome model is correctly specified, making them attractive for observational ad lift or notification impact studies.
Heterogeneous treatment effects ask who benefits, not just whether the average effect is positive. Use causal forests, meta-learners such as T-learners, S-learners, X-learners, or uplift models, but validate with honest sample splitting and avoid turning noisy subgroup effects into targeting policy.

Worked example

For Design causal measurement without randomization, a strong candidate would first clarify: “What exactly is the treatment: eligibility for the notification feature, actual notification sent, or notification opened?” They would also ask for the unit, likely user-level; the outcome, such as D7_retention; and the exposure and observation windows, such as feature exposure on day 0 and retention over days 1–7.

The answer skeleton should have four pillars. First, define a causal estimand, probably ATT: the impact on users who actually became eligible or received the notification. Second, draw the likely confounders: prior app opens, watch time, session frequency, country, device, tenure, notification settings, creator follows, historical retention, and prior push engagement. Third, propose a method: start with propensity score matching or inverse probability weighting using only pre-treatment covariates, then consider DiD if there are repeated pre/post outcomes. Fourth, discuss diagnostics: propensity overlap, covariate balance, placebo tests on pre-period outcomes, and sensitivity to calipers or model choice.

A specific tradeoff to flag is treatment definition. Estimating the effect of “notification opened” is more biased than estimating the effect of “notification sent” because openers are intrinsically more engaged; if logs support it, eligibility or send assignment is closer to an intervention. The candidate should close by saying that if more time were available, they would run robustness checks with doubly robust estimation, segment by new versus mature users, and compare the observational estimate to any future randomized holdout.

A second angle

For Analyze Negative Reviews' Impact on Coupon Repurchase Rate, the treatment is not a product exposure but an experience: receiving or seeing negative reviews before deciding whether to repurchase a coupon. The same causal machinery applies, but the confounding story changes: users who encounter negative reviews may be shopping different merchants, categories, price points, or deal types. A strong answer would define the unit as user-coupon or user-merchant, use pre-review covariates like historical purchase frequency, merchant rating, coupon discount, category, geo, and prior refund behavior, then estimate impact on coupon_repurchase_rate. DiD may be especially useful if the same merchants or cohorts can be observed before and after review shocks, but the candidate must check whether treated and comparison merchants had parallel repurchase trends before the negative reviews appeared.

Common pitfalls

Pitfall: Treating observational comparison as an A/B test.

A weak answer says, “Compare retention for users who got the feature versus users who did not.” That ignores selection bias. A better answer states the causal assumption explicitly: after conditioning on pre-treatment covariates, treatment assignment is as-good-as-random, then validates that assumption through balance, overlap, placebo outcomes, and sensitivity checks.

Pitfall: Optimizing the propensity model for prediction instead of balance.

A high AUC propensity model may separate treated and control users so well that there is little overlap, making matching unreliable. The goal is not to predict treatment perfectly; it is to create comparable groups. Report standardized mean differences, propensity score distributions, matched sample size, and which population the final estimate applies to.

Pitfall: Communicating only the method, not the decision implication.

Interviewers do not want a lecture that ends with “I would use PSM.” They expect an interpretation like: “If the ATT is +1.2 percentage points in D7_retention, confidence interval excludes zero, balance is acceptable, and placebo tests pass, I would recommend rollout for overlapping user segments, but not extrapolate to users outside support.”

Connections

Interviewers may pivot from here to A/B testing, sample ratio mismatch, metric design, uplift modeling, or off-policy evaluation for ranking and notification policies. They may also ask how you would combine causal estimates with product constraints such as user fatigue, long-term retention, creator ecosystem effects, or fairness across markets.

What's being tested

Core knowledge

Worked example

A second angle

Common pitfalls

Connections

Further reading

Featured in interview prep guides

Practice questions

Related concepts