Anomaly Detection: Time Series And Change Points For Meta Metrics

What's being tested

Interviewers are probing whether you can detect, validate, and diagnose unusual movement in Meta-scale product metrics such as `DAU`, `sessions`, `time_spent`, `ads_clicks`, `message_sends`, or ranking engagement rates. A strong Data Scientist separates true product change from noise, seasonality, instrumentation shifts, and mix effects, then turns the anomaly into a prioritized investigation plan. Meta cares because small relative changes on massive surfaces can affect millions of users, experiment decisions, revenue, or integrity outcomes. The expected answer is not “fit a model”; it is a statistically disciplined workflow for deciding whether something changed, when it changed, who it affected, and what evidence supports a causal story.

Core knowledge

Metric decomposition is the first move: split a top-line metric into rate, numerator, denominator, and cohort components. For example, `feed_likes_per_user` = `feed_likes` / `feed_viewers`; a drop could come from fewer viewers, fewer impressions per viewer, or lower likes per impression.
Expected baseline modeling should account for trend, day-of-week effects, holidays, launches, and known seasonality. A common decomposition is $y_t = T_t + S_t + H_t + \epsilon_t$ where $T_t$ is trend, $S_t$ seasonality, $H_t$ holiday/event effects, and $\epsilon_t$ residual noise.
Robust z-scores work well for quick anomaly triage when the metric is stable: $z_t = \frac{y_t - \text{median}(y_{t-k:t-1})}{1.4826 \cdot \text{MAD}(y_{t-k:t-1})}$ . Use median absolute deviation rather than standard deviation when outliers are common.
Confidence intervals matter more than point changes. For a rate $p = x/n$ , approximate standard error is $SE(p) = \sqrt{\frac{p(1-p)}{n}}$ ; for relative changes, compare uncertainty on both numerator and denominator, especially when traffic volume changes across segments.
Change point detection asks “when did the generating process change?” rather than “is this point unusual?” Useful methods include CUSUM for persistent mean shifts, PELT for offline segmentation, Bayesian online change point detection for streaming-style alerts, and STL residual alerts for seasonal series.
CUSUM is useful when small sustained drifts matter: track cumulative deviations from a target, e.g. $S_t = \max(0, S_{t-1} + y_t - \mu_0 - k)$ , and alert when $S_t > h$ . It is more sensitive to gradual degradation than single-point z-score rules.
Multiple testing is unavoidable at Meta scale. If you monitor hundreds of metrics across countries, platforms, app versions, and cohorts, naive $p<0.05$ alerts create noise. Use alert severity tiers, false discovery rate control, or require persistence across time buckets.
Segmentation is diagnostic, not just descriptive. Slice by `iOS` vs `Android`, app version, country, logged-in state, traffic source, surface, experiment exposure, creator/viewer cohorts, and new vs retained users. The goal is to find the minimal segment where the anomaly is concentrated.
Counterfactual comparison strengthens interpretation. Compare affected vs unaffected cohorts, similar regions, prior same weekday, or holdout groups if available. A simple difference-in-differences frame is $\Delta = (Y_{treated,post}-Y_{treated,pre})-(Y_{control,post}-Y_{control,pre})$ .
Data quality checks are in scope as analytical validation: compare related metrics, event volume, logging version, missingness, and denominator consistency. Do not design ingestion pipelines; query upstream logging metadata or validation dashboards to decide whether the metric movement is trustworthy.
Alert thresholds should reflect business cost. A `0.1%` movement in `DAU` may be material; a `5%` movement in a tiny experimental surface may be noise. Good thresholds combine statistical significance, practical significance, duration, and blast radius.
Root-cause hypotheses should be falsifiable. Examples: ranking model change reduced feed impressions, notification send volume dropped, app crash rate increased, a country-specific outage reduced sessions, or an experiment ramp changed exposure mix. Each hypothesis maps to a metric split that should move predictably.

Worked example

For “How would you investigate a sudden drop in `DAU`?”, start by clarifying the definition of `DAU`: logged-in users, unique users with any app activity, timezone used, inclusion of web, and whether the drop is absolute or relative. Then ask when the drop started, whether it is one time bucket or persistent, and whether the observed drop exceeds historical volatility for the same weekday and season.

A strong answer would organize around four pillars: validate the metric, quantify the anomaly, localize the affected population, and generate causal hypotheses. For validation, compare `DAU` against related metrics like `sessions`, `app_opens`, `feed_viewers`, login success, and event logging volume. For quantification, build an expected baseline using the prior 6–8 weeks with day-of-week controls and compute residuals or confidence bands. For localization, segment by platform, country, app version, acquisition channel, and new vs returning users to identify where the deviation is concentrated.

One explicit tradeoff to flag is sensitivity versus false positives: an aggressive threshold catches incidents early but may page teams for normal weekend or holiday variation. A DS answer should recommend severity levels, such as “investigate if the residual exceeds 3 robust standard deviations for two consecutive hours, escalate if the drop affects more than X million users or persists across daily aggregation.” Close by saying that if you had more time, you would compare against experiment ramps, launch calendars, crash metrics, and external events to distinguish product impact from logging or ecosystem effects.

A second angle

For “How would you detect a change point in `time_spent` after a Feed ranking launch?”, the same concepts apply, but the framing shifts from open-ended incident triage to estimating whether a known intervention changed the metric trajectory. You would define pre/post windows, exclude ramp-up instability if needed, and compare exposed users to a suitable control or holdout group. Because `time_spent` is usually heavy-tailed, you might analyze winsorized means, medians, or per-user capped values rather than raw averages alone. The main constraint is confounding: if the launch coincided with seasonality, another experiment, or a traffic mix shift, a naive pre/post change point is not enough. A better answer combines time-series evidence with experiment or quasi-experimental reasoning.

Common pitfalls

Pitfall: Treating every spike or dip as a product anomaly.

A tempting answer is “calculate a z-score and alert above 3.” That misses weekly seasonality, holidays, launch calendars, and correlated metrics. A better answer explains the expected baseline first, then evaluates residual movement and persistence.

Pitfall: Jumping to root cause before validating the metric.

Saying “the ranking model probably caused the drop” too early sounds decisive but weak. Interviewers want to hear that you would first check denominator definitions, logging completeness, related metrics, and whether the anomaly appears in independent signals like `sessions` or `crash_rate`.

Pitfall: Staying too high-level on segmentation.

“Slice by demographics and platform” is not enough. Strong candidates say what each split would prove: if only `iOS` app version `vX.Y` drops, suspect client release or logging; if only new users drop, suspect onboarding or acquisition; if only one country drops, suspect regional outage, policy, or local event.

Connections

This topic often pivots into experiment analysis, especially distinguishing anomaly detection from causal inference. It also connects to metric design, guardrail metrics, sequential testing, cohort analysis, and diagnosing ranking or recommender quality changes from product telemetry.