Skewed Distributions And Count Data

What's being tested

Meta interviewers use skewed count-data prompts to test whether a Data Scientist can reason beyond averages when product behavior is heterogeneous: comments per post, views per video, transfers per user, shares per viewer, or recommendation overlap. The core skill is recognizing right-skewed, heavy-tailed, and zero-inflated distributions, then choosing summaries, uncertainty estimates, and comparisons that answer the product question without being dominated by outliers. Meta cares because many product metrics are counts with power-user effects: a small number of creators, posts, or users can drive most activity while the typical user sees little change. The interviewer is probing whether you can explain distribution shape, sampling variability, causal caveats, and robust metric choices clearly enough to guide a product or ranking decision.

Core knowledge

Count data are nonnegative integers: views, comments, shares, transfers, friend requests, messages. They are often not symmetric or Gaussian. Start by asking the unit of analysis: per user, per session, per content item, per viewer-content impression, or per day.
Right skew means most observations are small, with a long upper tail. For metrics like video_views_per_video, typically $\text{mode} \le \text{median} \le \text{mean}$ because viral items pull the mean upward. Report all three when describing the distribution.
Heavy tails make the mean unstable. A few videos with 100M views can dominate the average, so add quantiles like p50, p75, p90, p95, and p99. For product interpretation, p50 describes the typical item; p99 describes extreme winners or possible abuse.
Zero inflation is common when many users take no action: zero comments, zero transfers, zero shares. Separate the process into two parts: $P(Y=0) \quad \text{and} \quad E[Y \mid Y>0]$ This distinguishes “more users participate” from “active users do more.”
Empirical quantiles are usually better than parametric assumptions for interview analysis. Sort values and take ranks, e.g. p90 is near index $\lceil 0.90n \rceil$ . For very large datasets, approximate algorithms like t-digest or KLL sketches can estimate quantiles, but as a DS you mainly need to know the approximation tradeoff.
Sampling distributions describe how an estimate varies across repeated samples. Even if raw comment counts are skewed, the sample mean can be approximately normal by the Central Limit Theorem when $n$ is large and variance is finite: $SE(\bar X)=\frac{s}{\sqrt n}$ Heavy tails slow convergence, so validate with bootstrap or robust summaries.
Bootstrap confidence intervals are useful for skewed distributions because they do not require normal raw data. Resample units, recompute the statistic, and take percentile intervals. Resample at the correct independence level: user-level for user metrics, video-level for video metrics, creator-level if creator clustering matters.
Binomial probability applies when each trial has a yes/no outcome, such as whether a viewer shares a video. If $X \sim \text{Binomial}(n,p)$ , then $P(X=k)=\binom nk p^k(1-p)^{n-k}, \quad E[X]=np, \quad Var(X)=np(1-p)$ Be careful: independence may fail if shares cluster by user, creator, or trending event.
Poisson models assume counts have equal mean and variance: $Y \sim \text{Poisson}(\lambda)$ . Product count data often show overdispersion, where $Var(Y)>E[Y]$ , making negative binomial models more realistic for comments, views, or transfers.
Log transforms help visualization, not always inference. Plot log1p(count) for histograms or regression features because it handles zeros: $\log(1+x)$ . But do not casually say “average log views increased” unless you can translate it back to product meaning.
Robust comparisons over time should include distributional movement, not just mean movement. Compare p50, p90, p99, share of zeros, active-user rate, and tail contribution such as “top 1% of videos account for X% of views.” This reveals whether growth is broad-based or concentrated.
Recommendation overlap can be measured with set metrics such as Jaccard similarity: $J(A,B)=\frac{|A \cap B|}{|A \cup B|}$ or overlap@K: $\frac{|A_K \cap B_K|}{K}$ In ranking contexts, also consider position-sensitive metrics like NDCG, because overlap in top slots matters more than overlap at rank 100.

Worked example

For Analyze skewed comments and sampling effects, a strong candidate would first clarify the unit: “Are we analyzing comments per post, per user, or per session, and over what time window?” They would also ask whether deleted comments, spam-filtered comments, or bot-like activity are included, not to design the logging pipeline, but to know what population the metric represents. The answer should be organized around four pillars: describe the raw distribution, choose robust summaries, reason about sampling uncertainty, and explain implications for decision-making.

The candidate might say that comments per post are likely right-skewed: most posts receive zero or a few comments, while a small number of viral posts receive thousands. They would report mode, median, mean, p90, p99, and fraction of zeros rather than only the average. For sampling, they would explain that the sample mean has standard error $s/\sqrt n$ , but because the underlying data are skewed, they would prefer bootstrap intervals for medians, quantiles, or tail-share metrics. A specific tradeoff to flag is interpretability versus sensitivity: the mean is useful for total comment volume and capacity planning, while the median or zero rate is better for the typical creator experience. They would close by saying that, with more time, they would segment by creator size, content type, surface, and geography to see whether the skew reflects healthy virality, ranking concentration, or spam.

A second angle

For Analyze Video View Distribution and Sharing Probability, the same distributional reasoning applies, but the prompt adds a probability layer. Views per video are likely heavy-tailed, so the candidate should summarize them with quantiles and tail contribution, not just average views. Sharing probability is a binary outcome at the viewer-video impression level, so binomial reasoning can estimate expected shares and uncertainty, but independence may be violated because users and videos are clustered. The candidate should separate correlation from causation: videos with more views may have higher share rates because the recommender selected engaging videos, not because views themselves caused sharing. A strong answer would propose stratification, regression adjustment, or an experiment if the business question is causal.

Common pitfalls

Pitfall: Treating the mean as “the typical value.”

For a heavy-tailed metric like views_per_video, saying “the average video gets 10,000 views” can be deeply misleading if the median is 120. A better answer is: “The mean describes aggregate volume, while the median describes the typical video; I’d report both and inspect the upper tail.”

Pitfall: Applying the Central Limit Theorem without caveats.

It is tempting to say, “The sample size is large, so everything is normal.” The stronger version is: “The sample mean may be approximately normal if observations are independent and variance is not dominated by extreme tails, but for quantiles or highly skewed metrics I’d use bootstrap or empirical intervals.”

Pitfall: Jumping to a causal story from a distribution shift.

If p99 transfer counts rise, that might mean stronger retention, a product change that helps power users, fraud, seasonality, or measurement changes. A better communication pattern is to list plausible mechanisms, identify observable signatures for each, then propose segmentation or experimental evidence to distinguish them.

Connections

Interviewers often pivot from skewed count data into experiment metric design, especially whether to use mean counts, active-user rates, winsorized means, or quantile metrics as primary outcomes. They may also pivot into causal inference, ranking evaluation, or anomaly diagnosis, such as deciding whether a spike in p99 views reflects recommender concentration, creator virality, or abuse.

What's being tested

Core knowledge

Worked example

A second angle

Common pitfalls

Connections

Further reading

Practice questions

Related concepts