Choose robust metrics for skewed comments
Company: Meta
Role: Data Scientist
Category: Statistics & Math
Difficulty: hard
Interview Round: Onsite
A website’s per-user daily comment counts are extremely skewed and zero-inflated. You roll out a backend optimization expected to increase engagement.
(a) Explain when mean, median, trimmed mean (10%), winsorized mean (95/5), and geometric mean of (1+count)−1 are preferable estimators of central tendency for such data. Discuss bias/variance trade-offs under heavy tails (e.g., Pareto) and interpretability for product decisions.
(b) Suppose the control group’s per-user counts for a day are [0,0,0,1,1,2,2,3,20,50] and treatment’s are [0,0,1,1,1,2,2,3,5,10]. Compute mean, median, 10% trimmed mean, and winsorized mean for each, and determine which estimator most reliably detects a practically meaningful improvement here. Justify rigorously.
(c) Describe how you would form a 95% CI for your chosen estimator using nonparametric bootstrap with stratification by user activity buckets. State assumptions and how you’d check them.
(d) If you must report an effect size that’s robust but comparable across experiments, propose a transformation and effect metric (e.g., log1p-based percent change or quantile treatment effect at τ=0.8) and defend its choice.
Quick Answer: This question evaluates understanding of robust estimation and inference for zero‑inflated, heavy‑tailed count data, including central tendency choices (mean, median, trimmed and winsorized means, geometric mean), nonparametric bootstrap confidence intervals, and robust effect‑size transformations.