On-Call Incident: Yahoo Mail DAU Down 10% on 2025-09-01
Assume all times are UTC, the product is global, and DAU is the canonical daily active user metric for Yahoo Mail. The 7-day baseline is 2025-08-25..2025-08-31. 2025-09-01 is a Monday and a US holiday (Labor Day), which may or may not be material.
Task
Outline an end-to-end plan to address the incident:
-
Verify the drop
-
Clarify DAU metric definition and identity, ensure bot/employee filters, and confirm experiment/treatment exclusions.
-
Check log/ETL health and data freshness.
-
Split by platform and app version (web, iOS, Android; version buckets).
-
Isolate internal vs external causes
-
Use a metric tree: DAU = new + returning; returning = retained + resurrected.
-
Segment by: geo, device, app version, notification eligibility, spam-folder rate, inbox-load latency, login failures, SMTP delivery latency, outages, release rollouts, paywall/ads.
-
Run quick SQL/Python checks
-
Find the top three contributing segments by absolute DAU loss vs the 7-day baseline.
-
Estimate each segment’s contribution to the total loss.
-
Propose a causal validation
-
Design a causal test or quasi-experiment (e.g., rollback A/B, holdout by app version, difference-in-differences with synthetic controls using unaffected geos) to validate the suspected cause.
-
Recommend mitigations and monitoring
-
Immediate mitigations.
-
Monitoring plan with alert thresholds, guardrail metrics (open rate, send success, latency), and a rollback criterion.
-
Be specific about data to pull, success metrics, and the order of operations within the first 2 hours.