Skewed And Heavy-Tailed Distributions
Asked of: Data Scientist
Last updated

What's being tested
Ability to recognize non-Gaussian, skewed/tail-heavy measurement behavior; choose appropriate summary statistics, tests, sampling, and models so inference and systems remain valid under heavy tails.
Core knowledge
- Heavy-tailed: tail P(X>x) ~ C x^{-α}; α≤2 implies infinite variance, α≤1 implies infinite mean.
- Common families: Pareto (power-law), log-normal (heavy-ish), Weibull (thin/heavy depending on k).
- Visualization: CCDF on log-log (linear → power-law); QQ-plot for tail deviations.
- Tail estimation: Hill estimator for α, MLE for Pareto above threshold x_min.
- Robust summaries: median, trimmed mean, winsorization, median-of-means, Catoni/Huber estimators.
- Inference caveats: classical t-test/CLT may fail or converge slowly under α∈(1,2]; use bootstrap, permutation, or robust estimators.
- Sampling/systems: rare heavy contributors bias simple sampling; use stratified/reservoir sampling or track heavy hitters separately.
Worked example — "How would you handle a heavy-tailed metric in an A/B test?"
First, frame the problem: define metric, expected tail behavior, and business tolerance for tail-driven effects (e.g., revenue spikes). Inspect empirical CCDF and compute a Hill estimate for tail index α. If α>2, mean-based tests are reasonable; if α∈(1,2], prefer robust estimators (median-of-means or trimmed mean) or model tail separately with Pareto and test differences in bulk and tail. Finally, choose inference: nonparametric permutation for medians, bootstrap with stratification, or use EVT-based confidence intervals for tail quantities.
A common pitfall
The tempting quick fix is to log-transform and run standard t-tests. Log transforms can hide zero/negative values, change effect interpretation, and misclassify log-normal vs power-law tails. Equally dangerous is trimming arbitrarily without justifying threshold—this removes business-relevant extremes and biases results. Always justify transformations, thresholds, and report both bulk and tail analyses.
Further reading
- M. Mitzenmacher, "A Brief History of Generative Models for Power Law and Lognormal Distributions" (2004).
- S. Resnick, "Heavy-Tail Phenomena: Probabilistic and Statistical Modeling" (2007).
Related concepts
- Heavy-Tailed and Zero-Inflated Distribution Analysis
- Skewed Distributions, Count Data, And Ratio Metrics
- Skewed Distributions And Count DataStatistics & Math
- Central Limit Theorem, Sampling, And Heavy-Tailed Metrics
- Distributional Analysis And Robust Statistics
- Probability Modeling, Expectation, And Variance