Causal Inference for AI Products

What's being tested

Interviewers probe your ability to design, analyze, and defend causal claims about product changes using rigorous statistical thinking. They expect a Data Scientist to choose an identification strategy (randomized or quasi-experimental), specify metrics and guardrails, and lay out a pre-registered analysis plan with diagnostics for key assumptions. Anthropic cares because product decisions (model changes, ranking, personalization) require credible causal estimates to avoid shipping regressions or misattributing effects to confounding.

Core knowledge

Average Treatment Effect (ATE) and Average Treatment effect on the Treated (ATT) definitions: ATE = E[Y(1) − Y(0)], ATT = E[Y(1) − Y(0) | T=1]; identification requires ignorability or instruments.
Randomization designs: completely randomized, stratified (blocked), cluster-randomized, and Bernoulli — tradeoffs: variance reduction vs implementation complexity and interference risk.
Power & sample size formula (continuous outcome):
$n_{per\ arm}=\left(\frac{Z_{1-\alpha/2}+Z_{1-\beta}}{\delta/\sigma}\right)^2$
where δ is minimum detectable effect, σ pooled SD; plan for variance inflation by clustering.
Sequential testing and peeking: use alpha spending or group-sequential boundaries (e.g., O’Brien–Fleming) or deploy sequentially adjusted p-values to control Type I error.
Multiple comparisons: adjust with Holm–Bonferroni, Benjamini–Hochberg for FDR, or pre-specify single primary metric to avoid p-hacking.
SUTVA & interference: verify no interference or model partial interference (clusters); if interference likely, prefer cluster randomization or exposure mappings.
Observational identification: propensity score (matching/weighting), inverse probability of treatment weighting (IPTW) with weights $w_i=\frac{T_i}{e(X_i)}+\frac{1-T_i}{1-e(X_i)}$ , doubly robust estimators, and diagnostics like standardized mean differences and overlap.

Tip: trim extreme propensities (e.g., 0.01–0.99) to reduce variance.
Quasi-experimental methods: difference-in-differences (DiD) (requires parallel trends), event-study for dynamic effects, and synthetic control for single-unit rollouts; always show pre-treatment trend balance.
Instrumental variables (IV): valid instrument Z must affect treatment but only affect outcome through treatment; use LATE interpretation and test strength (F-stat > 10).
Heterogeneous treatment effects (HTE) and uplift: use causal forests (econml, grf) or two-model uplift with honest splitting, and report subgroup robustness not cherry-picked slices.
Analysis plan & pre-registration: pre-specify primary metric, guardrails, inclusion/exclusion, missing-data handling, and estimand (ATE vs ATT) before unblinding.
Practical tooling: you’d typically run balance checks in pandas/R/statsmodels, estimate HTE with econml/causalml, and bootstrap CIs for nonparametric inference.

Worked example — measuring a new model’s effect on 28-day retention

Frame: ask clarifying questions first — what is the unit of randomization (user, session, device), primary metric (28-day retained = binary), expected effect size, rollout constraints, and whether there’s likely interference (multiple accounts per household). Skeleton answer pillars: 1) Design: choose individual-level randomization if interference is low; otherwise cluster by household/geography. 2) Metrics & guards: pre-register primary metric (28-day retention), key guardrails (engagement drops, errors), and secondary metrics (CTR, NPS). 3) Analysis plan: specify estimand (ATE), use logistic/linear regression with pre-treatment covariates for precision, compute adjusted and unadjusted ATE, report bootstrap CIs. 4) Diagnostics: covariate balance, early look alpha spending plan, and check for heterogenous effects. Tradeoff to flag: cluster randomization reduces contamination but increases required sample size (design effect = 1 + (m−1)ρ). Close by saying: if more time, add subgroup HTE via causal forest, pre-register multiple-comparison corrections, and run a parallel DiD on historical cohorts to triangulate long-term retention.

A second angle — observational rollout with non-random exposure

Now suppose the new model rolled out to a subset of users (non-random) and you must estimate causal effect from logs. Use a quasi-experimental toolkit: 1) inspect pre-rollout trends and covariate balance; if parallel trends plausible, implement DiD or an event-study; 2) if user selection is based on observable features, apply propensity-score weighting plus doubly robust estimation; 3) if there's an instrument (e.g., rollout timing assigned by server shard), consider IV and interpret as LATE. Emphasize diagnostics: pre-trend plots, SMD < 0.1, overlap support, and sensitivity analyses (Rosenbaum bounds) to quantify unobserved confounding.

Common pitfalls

Pitfall: Conditioning on post-treatment variables (mediators) produces biased ATE estimates. Analysts often adjust for downstream behavior; instead, avoid controlling for any variables affected by treatment.

Pitfall: Overstating causality without stating assumptions. Saying “the model caused retention to increase” is weaker than “under ignorability given X, estimated ATE=…”; explicitly name the identifying assumption and diagnostics.

Pitfall: Ignoring clustering or interference. A tempting but wrong shortcut is to treat clustered assignment as independent observations; this underestimates variance and produces overconfident p-values.

Connections

This area naturally connects to A/B testing platforms and experimentation pipelines, heterogeneous treatment effect estimation and policy learning, and to product-risk analyses like safety/regression detection. Interviewers may pivot to uplift modeling, deployment risk, or time-to-event survival analyses.