Difference-In-Differences

What's being tested

Interviewers are probing whether you can estimate a causal treatment effect when Amazon cannot or did not run a clean randomized experiment. The shared skill is translating a product or policy rollout—reminders, concessions, gift cards, seller programs, customer messaging—into a credible panel-data causal design with explicit identification assumptions. You need to explain not only the estimator, but also why it would or would not recover the effect Amazon cares about: incremental purchases, repeat engagement, defect reduction, concession cost, NPS, or downstream retention. A strong Data Scientist answer shows judgment around treatment definition, comparison group construction, pre-trend diagnostics, heterogeneous effects, and uncertainty.

Core knowledge

Difference-in-differences compares outcome changes for treated units versus control units:
$\hat{\tau}_{DiD}=(\bar{Y}_{T,post}-\bar{Y}_{T,pre})-(\bar{Y}_{C,post}-\bar{Y}_{C,pre})$
The target is usually an average treatment effect on the treated, not necessarily a population-wide launch effect.
Parallel trends is the central identifying assumption: absent treatment, treated and control units would have experienced the same average outcome trend. It is not testable directly, but you can assess credibility using multiple pre-periods, placebo interventions, business context, and covariate balance.
Unit of analysis must match the intervention and decision. A reminder feature may be at customer_id-week, a seller policy at seller_id-month, and a concession gift-card policy at marketplace-week or order_id cohort. Wrong granularity can create spillovers, double counting, or invalid standard errors.
Panel construction usually means one row per unit-time, including treated and untreated units, treatment timing, outcomes, and pre-treatment covariates. In SQL, this often involves aggregating events to user-week metrics, joining rollout dates, and filling zero-activity periods so attrition does not masquerade as impact.
Two-way fixed effects estimates
$Y_{it}=\alpha_i+\lambda_t+\beta D_{it}+\epsilon_{it}$
where $\alpha_i$ absorbs stable unit differences and $\lambda_t$ absorbs common shocks. It is intuitive, but can be biased under staggered adoption with heterogeneous treatment effects.
Staggered rollout needs extra care because already-treated units can become implicit controls for newly treated units. Prefer event-study estimators or modern group-time approaches such as Callaway-Sant’Anna, Sun-Abraham, or stacked DiD when adoption timing varies and effects evolve over time.
Event-study diagnostics estimate leads and lags around treatment:
$Y_{it}=\alpha_i+\lambda_t+\sum_{k\neq -1}\beta_k 1[t-T_i=k]+\epsilon_{it}$
Pre-treatment lead coefficients should be near zero; post-treatment lags reveal ramp-up, decay, or delayed behavior.
Matching or weighting can make DiD more credible when treated and control units differ substantially. Propensity score matching, inverse probability weighting, or exact matching on geography, baseline activity, Prime status, seller segment, or pre-period outcome trends can reduce extrapolation, but they do not fix unobserved time-varying confounding.
Clustering standard errors matters because observations for the same customer, seller, product, or marketplace over time are correlated. Cluster at the treatment-assignment level when possible; for few clusters, use wild cluster bootstrap or aggregate to the cluster-time level before inference.
Spillovers and interference can break DiD. For example, treating some sellers with faster concessions may affect buyer expectations or competing sellers’ demand. If interference is plausible, define larger clusters, exclude exposed controls, or estimate market-level effects rather than individual-level effects.
Outcome choice should distinguish short-term behavioral movement from business value. For a reminder intervention, track open_rate, conversion, incremental orders, revenue, unsubscribes, and long-term retention; for concessions, track defect resolution, gift-card redemption, repeat purchase, concession cost, and customer trust metrics.
Robustness checks should be planned, not improvised: placebo treatment dates, placebo outcomes that should not move, alternative control groups, alternative time windows, covariate-adjusted specifications, excluding launch week, winsorizing outliers, and checking sensitivity by segment.

Worked example

For “Evaluate concession gift-card policy with DID”, a strong candidate would first clarify the treatment: “Is the policy rolled out by marketplace, customer segment, issue type, or time? What outcome is primary—concession cost, repeat purchase, customer satisfaction, or contact reduction?” They would also ask whether rollout timing was random, capacity-driven, or targeted toward high-defect areas, because targeting creates selection risk.

The answer skeleton should have four pillars. First, define the panel, such as marketplace-week or customer_issue_type-week, with treatment start date, outcomes, and pre-period covariates. Second, propose a DiD or event-study model with unit and calendar-time fixed effects, clustered standard errors, and explicit exclusion of contaminated units if the policy leaked. Third, diagnose assumptions using pre-trends, placebo dates, and balance on baseline concession rate, contact rate, order volume, and defect mix. Fourth, interpret both customer and cost metrics, because a policy may improve retention while increasing concession expense.

A concrete tradeoff to flag: a customer-level panel gives more statistical power and supports segment analysis, but if treatment assignment happened at marketplace or policy-rule level, customer-level clustering without respecting assignment may overstate precision. The close should sound practical: “If I had more time, I would compare a modern staggered-adoption estimator against standard two-way fixed effects and run sensitivity by issue category, because gift-card effects may be large for delivery defects but small for product-quality complaints.”

A second angle

For “Design causal study for reminder impact”, the same framework applies, but treatment exposure is usually more behaviorally ambiguous. A reminder may be assigned, delivered, opened, clicked, or acted upon; the cleanest estimand might be the effect of being eligible for reminders, not the effect among openers, because openers are self-selected. The candidate should define pre-period engagement, seasonality, channel saturation, and opt-out behavior before choosing controls. If reminder rollout was staggered across cohorts, an event study can show whether treated users were already trending upward before the reminder. If the interviewer pushes on compliance, distinguish intent-to-treat from treatment-on-the-treated and explain why the latter needs stronger assumptions or an instrument.

Common pitfalls

Pitfall: Treating “parallel pre-trends look similar” as proof.

Similar-looking pre-trends increase credibility but do not prove identification. A better answer says: “I would use pre-trends, placebo outcomes, and rollout rationale to argue plausibility, while acknowledging that unobserved time-varying shocks—like a local promotion or policy change—could still bias the estimate.”

Pitfall: Jumping straight to the regression without defining the estimand.

A weak answer starts with $Y = \alpha + \beta D + ...$ before saying what $\beta$ means. A stronger answer first states whether the goal is effect on treated customers, effect of rollout eligibility, effect of actual usage, or expected launch impact across all Amazon customers.

Pitfall: Using standard two-way fixed effects mechanically for staggered adoption.

Classic fixed effects can produce misleading weighted averages when treatment effects vary by cohort or time since treatment. Mentioning event-study leads/lags is good; naming the treatment-heterogeneity issue and proposing group-time or stacked estimators shows deeper applied econometrics judgment.

Connections

Interviewers may pivot from DiD into propensity score matching, synthetic control, instrumental variables, double machine learning, or randomized A/B testing as alternative causal strategies. They may also ask how you would operationalize the analysis in SQL or Python, but for a Data Scientist the focus remains on cohort construction, metric validity, assumptions, and interpretation rather than pipeline architecture.

What's being tested

Core knowledge

Worked example

A second angle

Common pitfalls

Connections

Further reading

Featured in interview prep guides

Practice questions

Related concepts