Skewed Distributions, Count Data, And Ratio Metrics

What's being tested

Interviewers are probing whether you can analyze product metrics that are not approximately normal: comments per post, reactions per creator, messages per user, clicks per impression, or comments per viewer. At Meta, many engagement metrics are zero-inflated, heavy-tailed, and dominated by a small number of users, posts, or creators, so naive averages can be misleading even when computed correctly. The real skill is choosing summaries, uncertainty estimates, and tests that match the business question: “typical experience,” “total ecosystem activity,” “creator concentration,” or “treatment effect.” They are not testing memorized definitions of mean and median; they are testing whether you can reason about metric behavior under skew, outliers, sampling, and ratio denominators.

Core knowledge

Count engagement data often has many zeros and a long right tail: e.g., most posts get 0–2 comments, while a viral post gets 100K. Always inspect $P(X=0)$ , mean, median, upper quantiles, max, and share of total from top entities.
Mean and median answer different product questions. The mean $\bar{x}=\frac{1}{n}\sum_i x_i$ measures total volume per unit and is sensitive to virality; the median measures typical experience and may be 0 for zero-inflated data, making it stable but sometimes uninformative.
Robust alternatives include trimmed means, winsorized means, and quantiles. A 5% trimmed mean drops the bottom/top 5%; a winsorized mean caps values at chosen percentiles. These reduce outlier leverage but can hide legitimate viral engagement if total ecosystem impact matters.
For exact quantiles, sort values: $O(n\log n)$ time and $O(n)$ memory, practical for millions of rows if data fits memory. At warehouse scale, use approximate algorithms such as t-digest, KLL sketches, or Presto/Spark approx_percentile, trading small rank error for huge memory savings.
Percentiles are more interpretable than variance for skewed data. Report p50, p75, p90, p95, p99, and sometimes p99.9. For counts, many adjacent quantiles may be identical because the distribution is discrete; say so rather than over-interpreting tiny percentile differences.
Sampling variability differs by statistic. For the mean, $\mathrm{SE}(\bar{x})=s/\sqrt{n}$ and the CLT may work with large $n$ , but convergence is slow under heavy tails. For medians or p95, use nonparametric bootstrap or quantile asymptotics rather than assuming normality blindly.
Bootstrap procedure: resample rows with replacement, recompute the statistic, repeat 1,000–10,000 times, and use percentile or BCa intervals. It works well for medians, trimmed means, and ratios, but can be unstable for extreme p99.9 metrics if rare tail events are undersampled.
Count models provide diagnostic structure. Poisson assumes $\mathbb{E}[X]=\mathrm{Var}(X)$ ; social engagement usually has overdispersion, so negative binomial is often better. Zero-inflated Poisson or hurdle models separate “whether any comments happened” from “how many given positive comments.”
For concentration, use top-share, Lorenz curves, Gini coefficient, or Herfindahl-Hirschman Index: $HHI=\sum_i s_i^2$ where $s_i$ is entity $i$ ’s share of total comments. These identify whether engagement is broad-based or dominated by a few viral posts or creators.
Ratio metrics require denominator discipline. “Comments per viewer” as $\frac{\sum_i comments_i}{\sum_i viewers_i}$ is not the same as average per-post rate $\frac{1}{n}\sum_i \frac{comments_i}{viewers_i}$ . The former weights by exposure; the latter weights each post equally and explodes when denominators are small.
For A/B tests on skewed metrics, prefer user-level randomization and user-level aggregates to avoid correlated observations. Estimate treatment effects on means with robust/sandwich SEs, bootstrap, or randomization inference; for ratio metrics, use delta method, linearization, or cluster bootstrap.
Edge cases matter: zero denominators, bot/spam bursts, deleted content, logging delays, and duplicate events can dominate tail metrics. Define inclusion rules before analysis: time window, unit of analysis, spam filtering, whether to cap extreme values, and whether caps are for reporting or decision-making.

Worked example

For Choose robust metrics for skewed comments, a strong candidate would start by clarifying the unit of analysis: comments per post, per article, per viewer, per user, or per session. They would ask whether the business goal is to understand the typical content experience, total comment volume, creator health, or tail risk from viral posts, because each goal implies a different metric. A clean answer could be organized around four pillars: first, characterize the distribution with zero rate, mean, median, p90/p95/p99, and max; second, choose summaries aligned to the product question; third, quantify uncertainty with bootstrap or robust standard errors; fourth, validate results against spam, bot activity, and logging anomalies.

They should explicitly say that the median may be 0 if most articles receive no comments, so reporting only the median could be technically robust but practically useless. A better reporting set might include mean for total engagement, median or p75 for typical experience, p95/p99 for tail behavior, and top-1% share for concentration. The main tradeoff is whether to cap or winsorize extreme articles: capping improves stability and comparability, but it may remove exactly the viral behavior the business cares about. If comparing two surfaces or experiments, they should recommend computing the metric at the randomized unit, not treating every comment as independent. They could close by saying: “If I had more time, I’d fit a hurdle or negative binomial model to separate the probability of receiving any comments from the intensity among articles that do receive comments.”

A second angle

For Characterize metric distribution and quantiles, the same concepts appear, but the interviewer is more focused on descriptive rigor and computational practicality. Instead of immediately choosing a robust business metric, the candidate should explain how they would compute and interpret empirical quantiles, decide between exact and approximate methods, and communicate distribution shape. The right answer would mention that p50, p90, and p99 can tell very different stories from the mean, especially if a small number of entities account for most activity. If the data has billions of rows, exact sorting may be expensive, so approximate quantile sketches like t-digest or KLL are appropriate, with a note about rank error and validation on a smaller exact sample. The framing shifts from “which metric should we use?” to “how do we faithfully summarize and compute the distribution?”

Common pitfalls

Analytical mistake: treating skewed counts as normal because $n$ is large. A tempting answer is “use the mean and a t-test; the CLT handles it.” That may be acceptable for very large-sample mean differences, but it ignores slow convergence, extreme leverage, zero inflation, and the fact that product decisions often depend on medians, percentiles, or concentration, not just means.

Communication mistake: recommending a metric without tying it to the product question. Saying “use the median because it is robust” is incomplete if leadership cares about total discussion volume or advertiser-visible engagement. A stronger answer says, “For typical user experience I’d use median/p75; for ecosystem volume I’d use mean or ratio-of-sums; for creator concentration I’d use top-share or Gini.”

Depth mistake: ignoring denominator and unit-of-analysis issues in ratios. “Comments per user” can mean total comments divided by active users, average comments among commenters, or average post-level comment rate. The better answer defines numerator, denominator, eligibility, time window, and whether the metric is user-weighted, post-weighted, or impression-weighted.

Connections

Interviewers often pivot from here into experimentation, especially ratio metrics, variance reduction, CUPED, cluster-robust standard errors, and bootstrap confidence intervals. They may also move toward product analytics topics such as metric design, guardrail metrics, bot/spam filtering, or ecosystem health metrics like creator concentration and inequality. If the discussion becomes more statistical, expect follow-ups on Poisson versus negative binomial modeling, zero-inflated models, multiple testing across percentiles, or nonparametric tests such as Mann-Whitney U.