Skewed Distributions And Count Data
Asked of: Data Scientist
Last updated

What's being tested
Meta interviewers use skewed count-data prompts to test whether a Data Scientist can reason beyond averages when product behavior is heterogeneous: comments per post, views per video, transfers per user, shares per viewer, or recommendation overlap. The core skill is recognizing right-skewed, heavy-tailed, and zero-inflated distributions, then choosing summaries, uncertainty estimates, and comparisons that answer the product question without being dominated by outliers. Meta cares because many product metrics are counts with power-user effects: a small number of creators, posts, or users can drive most activity while the typical user sees little change. The interviewer is probing whether you can explain distribution shape, sampling variability, causal caveats, and robust metric choices clearly enough to guide a product or ranking decision.
Core knowledge
-
Count data are nonnegative integers: views, comments, shares, transfers, friend requests, messages. They are often not symmetric or Gaussian. Start by asking the unit of analysis: per user, per session, per content item, per viewer-content impression, or per day.
-
Right skew means most observations are small, with a long upper tail. For metrics like
video_views_per_video, typically because viral items pull the mean upward. Report all three when describing the distribution. -
Heavy tails make the mean unstable. A few videos with 100M views can dominate the average, so add quantiles like
p50,p75,p90,p95, andp99. For product interpretation,p50describes the typical item;p99describes extreme winners or possible abuse. -
Zero inflation is common when many users take no action: zero comments, zero transfers, zero shares. Separate the process into two parts: This distinguishes “more users participate” from “active users do more.”
-
Empirical quantiles are usually better than parametric assumptions for interview analysis. Sort values and take ranks, e.g.
p90is near index . For very large datasets, approximate algorithms like t-digest or KLL sketches can estimate quantiles, but as a DS you mainly need to know the approximation tradeoff. -
Sampling distributions describe how an estimate varies across repeated samples. Even if raw comment counts are skewed, the sample mean can be approximately normal by the Central Limit Theorem when is large and variance is finite: Heavy tails slow convergence, so validate with bootstrap or robust summaries.
-
Bootstrap confidence intervals are useful for skewed distributions because they do not require normal raw data. Resample units, recompute the statistic, and take percentile intervals. Resample at the correct independence level: user-level for user metrics, video-level for video metrics, creator-level if creator clustering matters.
-
Binomial probability applies when each trial has a yes/no outcome, such as whether a viewer shares a video. If , then Be careful: independence may fail if shares cluster by user, creator, or trending event.
-
Poisson models assume counts have equal mean and variance: . Product count data often show overdispersion, where , making negative binomial models more realistic for comments, views, or transfers.
-
Log transforms help visualization, not always inference. Plot
log1p(count)for histograms or regression features because it handles zeros: . But do not casually say “average log views increased” unless you can translate it back to product meaning. -
Robust comparisons over time should include distributional movement, not just mean movement. Compare
p50,p90,p99, share of zeros, active-user rate, and tail contribution such as “top 1% of videos account for X% of views.” This reveals whether growth is broad-based or concentrated. -
Recommendation overlap can be measured with set metrics such as Jaccard similarity: or overlap@K: In ranking contexts, also consider position-sensitive metrics like
NDCG, because overlap in top slots matters more than overlap at rank 100.
Worked example
For Analyze skewed comments and sampling effects, a strong candidate would first clarify the unit: “Are we analyzing comments per post, per user, or per session, and over what time window?” They would also ask whether deleted comments, spam-filtered comments, or bot-like activity are included, not to design the logging pipeline, but to know what population the metric represents. The answer should be organized around four pillars: describe the raw distribution, choose robust summaries, reason about sampling uncertainty, and explain implications for decision-making.
The candidate might say that comments per post are likely right-skewed: most posts receive zero or a few comments, while a small number of viral posts receive thousands. They would report mode, median, mean, p90, p99, and fraction of zeros rather than only the average. For sampling, they would explain that the sample mean has standard error , but because the underlying data are skewed, they would prefer bootstrap intervals for medians, quantiles, or tail-share metrics. A specific tradeoff to flag is interpretability versus sensitivity: the mean is useful for total comment volume and capacity planning, while the median or zero rate is better for the typical creator experience. They would close by saying that, with more time, they would segment by creator size, content type, surface, and geography to see whether the skew reflects healthy virality, ranking concentration, or spam.
A second angle
For Analyze Video View Distribution and Sharing Probability, the same distributional reasoning applies, but the prompt adds a probability layer. Views per video are likely heavy-tailed, so the candidate should summarize them with quantiles and tail contribution, not just average views. Sharing probability is a binary outcome at the viewer-video impression level, so binomial reasoning can estimate expected shares and uncertainty, but independence may be violated because users and videos are clustered. The candidate should separate correlation from causation: videos with more views may have higher share rates because the recommender selected engaging videos, not because views themselves caused sharing. A strong answer would propose stratification, regression adjustment, or an experiment if the business question is causal.
Common pitfalls
Pitfall: Treating the mean as “the typical value.”
For a heavy-tailed metric like views_per_video, saying “the average video gets 10,000 views” can be deeply misleading if the median is 120. A better answer is: “The mean describes aggregate volume, while the median describes the typical video; I’d report both and inspect the upper tail.”
Pitfall: Applying the Central Limit Theorem without caveats.
It is tempting to say, “The sample size is large, so everything is normal.” The stronger version is: “The sample mean may be approximately normal if observations are independent and variance is not dominated by extreme tails, but for quantiles or highly skewed metrics I’d use bootstrap or empirical intervals.”
Pitfall: Jumping to a causal story from a distribution shift.
If p99 transfer counts rise, that might mean stronger retention, a product change that helps power users, fraud, seasonality, or measurement changes. A better communication pattern is to list plausible mechanisms, identify observable signatures for each, then propose segmentation or experimental evidence to distinguish them.
Connections
Interviewers often pivot from skewed count data into experiment metric design, especially whether to use mean counts, active-user rates, winsorized means, or quantile metrics as primary outcomes. They may also pivot into causal inference, ranking evaluation, or anomaly diagnosis, such as deciding whether a spike in p99 views reflects recommender concentration, creator virality, or abuse.
Further reading
-
An Introduction to Statistical Learning — clear treatment of model evaluation, transformations, and practical statistical reasoning.
-
Wasserman, All of Statistics — concise reference for sampling distributions, standard errors, bootstrap, and confidence intervals.
-
Hilbe, Negative Binomial Regression — deeper treatment of overdispersed count data beyond the Poisson model.
Practice questions
- Model comment count distribution and validate assumptionsMeta · Data Scientist · Onsite · Medium
- Characterize and compare transfer-count distributions over timeMeta · Data Scientist · Onsite · medium
- Characterize metric distribution and quantilesMeta · Data Scientist · Onsite · medium
- Analyze skewed comments and sampling effectsMeta · Data Scientist · Onsite · medium
- Analyze Video View Distribution and Sharing ProbabilityMeta · Data Scientist · Onsite · medium
- Analyze User-Comment Distribution to Understand EngagementMeta · Data Scientist · Onsite · medium
- Analyze View Distribution and Recommendation Overlap in VideosMeta · Data Scientist · Onsite · medium
Related concepts
- Skewed Distributions, Count Data, And Ratio Metrics
- Central Limit Theorem, Sampling, And Heavy-Tailed Metrics
- Skewed And Heavy-Tailed Distributions
- Heavy-Tailed and Zero-Inflated Distribution Analysis
- Distributional Analysis And Robust Statistics
- Statistical Inference, Power, And Metric UncertaintyStatistics & Math