Apply Double ML with text-address features
Company: Amazon
Role: Data Scientist
Category: Machine Learning
Difficulty: hard
Interview Round: HR Screen
Estimate the ATE of receiving a first reminder on CSAT using Double Machine Learning (DML), incorporating text from user addresses. Specify: 1) Outcome Y, treatment T, and feature set X, including how you represent addresses (e.g., geocoding + neighborhood attributes, and an address text embedding); 2) The orthogonalized moment you will use and how you implement sample-splitting and K-fold cross-fitting; 3) Choices of nuisance learners for E[Y|X] and E[T|X] (e.g., gradient boosting for Y, calibrated logistic for T), hyperparameters, and how you prevent leakage from post-treatment variables; 4) Diagnostics for overlap/positivity and how you would trim or reweight; 5) How you would test sensitivity to unobserved confounding (e.g., Oster δ or partial R²), and report subgroup effects (device, channel) while controlling FDR; 6) How you would validate text features (ablation tests, SHAP consistency across folds) and mitigate geographic privacy/fairness risks (e.g., excluding protected proxies, coarse geohashes).
Quick Answer: This question evaluates a candidate's proficiency in causal inference and Double Machine Learning for estimating average treatment effects from observational data, including representation and validation of address-derived text features, nuisance estimation, overlap diagnostics, sensitivity analysis, subgroup effect reporting, and geographic privacy/fairness concerns; it is categorized under Machine Learning and applied causal inference. It is commonly asked to gauge both conceptual understanding of orthogonalization, sample-splitting and identification assumptions and practical application skills in selecting and validating text/geospatial features, performing overlap/positivity checks, conducting sensitivity analyses, and controlling for multiple comparisons, representing a mix of conceptual understanding and practical implementation.