Production Incident: 10% Drop in Daily Likes (DAU Flat) on 2025-09-01
You are investigating a 10% day-over-day drop in daily Like actions on a global social network on 2025-09-01, while DAU remained flat versus the prior 14-day average. Treat this as a production incident.
Provide an end-to-end diagnostic plan that includes:
-
Data validation checks
-
Enumerate concrete checks (e.g., event volume parity vs app_sessions, null-rate and schema-change diffs, client→server receipt ratios, deduplication rates, timestamp skew, ingestion delay, endpoint error rates).
-
Define specific thresholds that strongly indicate instrumentation failure vs normal noise.
-
Expected variability and exogenous factors
-
How you will rule out seasonality and expected variability (day-of-week, month-start effects), regional holidays, release calendars, and major outages/events.
-
Specify what external data you would join and how.
-
Localization
-
Localize the drop by funnel stage: feed impressions → post views → like impressions → like actions. Compute stage-to-stage conversion rates.
-
Localize by segments: surface (friends vs pages vs events), platform (iOS/Android/Web), app version, geography, new vs retained users, time-of-day.
-
Hypotheses and tests
-
State three falsifiable hypotheses for the drop.
-
For each, list the exact cuts/queries/plots you will run to test them.
-
Decision tree to root cause (≤24 hours)
-
A concrete decision tree that moves from validation → external factors → localization → root cause.
-
Include what you’d do if multiple segments move in opposite directions (mix-shift/Simpson’s paradox).
-
Mitigation and prevention
-
One immediate mitigation (safe to deploy quickly).
-
One follow-up experiment/process to prevent recurrence.
Finally, list the first three queries or charts you would run and the precise outcomes that would make you conclude that behavior change (not logging) is at fault.