Skewed And Heavy-Tailed Distributions — Tech Interview Concept

What's being tested
Ability to recognize non-Gaussian, skewed/tail-heavy measurement behavior; choose appropriate summary statistics, tests, sampling, and models so inference and systems remain valid under heavy tails.

Core knowledge

Heavy-tailed: tail P(X>x) ~ C x^{-α}; α≤2 implies infinite variance, α≤1 implies infinite mean.
Common families: Pareto (power-law), log-normal (heavy-ish), Weibull (thin/heavy depending on k).
Visualization: CCDF on log-log (linear → power-law); QQ-plot for tail deviations.
Tail estimation: Hill estimator for α, MLE for Pareto above threshold x_min.
Robust summaries: median, trimmed mean, winsorization, median-of-means, Catoni/Huber estimators.
Inference caveats: classical t-test/CLT may fail or converge slowly under α∈(1,2]; use bootstrap, permutation, or robust estimators.
Sampling/systems: rare heavy contributors bias simple sampling; use stratified/reservoir sampling or track heavy hitters separately.

Worked example — "How would you handle a heavy-tailed metric in an A/B test?"
First, frame the problem: define metric, expected tail behavior, and business tolerance for tail-driven effects (e.g., revenue spikes). Inspect empirical CCDF and compute a Hill estimate for tail index α. If α>2, mean-based tests are reasonable; if α∈(1,2], prefer robust estimators (median-of-means or trimmed mean) or model tail separately with Pareto and test differences in bulk and tail. Finally, choose inference: nonparametric permutation for medians, bootstrap with stratification, or use EVT-based confidence intervals for tail quantities.

A common pitfall
The tempting quick fix is to log-transform and run standard t-tests. Log transforms can hide zero/negative values, change effect interpretation, and misclassify log-normal vs power-law tails. Equally dangerous is trimming arbitrarily without justifying threshold—this removes business-relevant extremes and biases results. Always justify transformations, thresholds, and report both bulk and tail analyses.

Further reading

M. Mitzenmacher, "A Brief History of Generative Models for Power Law and Lognormal Distributions" (2004).
S. Resnick, "Heavy-Tail Phenomena: Probabilistic and Statistical Modeling" (2007).

Related concepts