Estimate delayed CVR nonparametrically with censored data
Company: Meta
Role: Data Scientist
Category: Statistics & Math
Difficulty: Medium
Interview Round: Technical Screen
Today is 2025-09-01. We need the 14-day conversion rate (CVR14) for impressions served between 2025-08-18 and 2025-09-01, but many conversions occur with unknown delays up to 14 days, so recent impressions are right-censored. You cannot assume any parametric delay distribution.
Tasks:
1) Propose a nonparametric estimator for CVR14 that uses historical cohorts to learn the time-to-convert survival function and applies it to the current, partially observed cohort (e.g., Kaplan–Meier for conversion delay with right-censoring, then inverse-probability weighting to debias the observed-to-date converts). Write formulas for the estimator and indicate the data each term uses.
2) Construct a 95% confidence interval using Greenwood’s formula for the KM variance and the delta method for the transformed CVR, stating assumptions. Explain how you would widen intervals if you suspect non-stationarity of delays.
3) Provide a distribution-free conservative bound for CVR14 that makes minimal assumptions (e.g., DKW inequality on the empirical CDF of delays or Clopper–Pearson on observed conversions plus a worst-case bound for yet-unfinished impressions). Show how to compute it from raw counts available today.
4) Describe diagnostics to check whether the historical delay distribution is applicable now (e.g., compare covariate-shift via PSI/KS tests on traffic mix, day-of-week effects, or device splits) and how to stratify/weight if shift is detected.
5) If you can observe only aggregated daily counts of impressions and same-day conversions (no user-level data), outline an identifiable approach and the additional assumptions required to estimate or bound CVR14.
Quick Answer: This question evaluates competence in handling right-censored time-to-event data, nonparametric estimation and inference for delayed conversions, construction of confidence intervals and distribution-free bounds, diagnostic checks for nonstationarity, and reasoning about identifiability under aggregated data.