This question evaluates a data scientist's skills in root-cause analysis, metric instrumentation verification, segment-wise localization, attribution of internal versus external causes, and time-series causal-impact and anomaly-detection methods.

A core KPI (comments_per_DAU) suddenly drops materially. Outline a structured root-cause analysis and validation plan.
a) Scoping and sanity: Quantify the drop (absolute and relative), confirm it across dashboards and raw logs, and rule out instrumentation (release notes, schema changes, event loss, clock skew). What guardrail metrics do you check first and why?
b) Slice-and-localize: Describe the exact segmentation you would run (platform/app version, geo, cohort, time-of-day, traffic source, user tenure, content category, creators vs consumers) and the funnel steps to inspect. How do you separate mix-shift vs within-segment degradation?
c) Attribution: Enumerate internal and external potential causes (code deploys/feature flags, AB tests, recommender changes, moderation policies, outages; holidays/news, competitor actions). For each, propose a validation: quick reversion, killing a flag, switchback on suspect features, holdout comparisons, difference-in-differences with an unaffected market, placebo tests.
d) Time-series methods: Specify an anomaly detection or causal impact approach (e.g., STL + CUSUM, BSTS/SCUL) with controls. What prior period and controls would you pick, and how do you guard against post-treatment contamination?
e) Decision and follow-up: Define thresholds for declaring a root cause vs correlation, the rollback/mitigation plan, and how you’d monitor recovery and prevent recurrence (alerts, canaries, auto rollbacks, dashboards).