Diagnose SLA drops and prioritize fixes
Company: Snapchat
Role: Technical Program Manager
Category: Product / Decision Making
Difficulty: medium
Interview Round: Onsite
You are a Technical Program Manager responsible for an ML platform or service.
Explain how you would perform root-cause analysis if a service's SLA suddenly drops and how you would improve reliability afterward. Also discuss how you would evaluate project ROI or cost savings, make cross-functional teams accountable, and respond when headline metrics look healthy but leadership is still dissatisfied.
### Constraints & Assumptions
- SLA could refer to availability, latency, freshness, throughput, or model-serving correctness.
- Stabilize the service before running a full postmortem.
- Identify triggering cause, contributing factors, and why safeguards failed.
- Avoid blame; focus on mechanisms, ownership, and durable fixes.
### Clarifying Questions to Ask
- Which SLA dropped and when?
- Which users, regions, models, pipelines, or downstream products are affected?
- Was there a recent deployment, config change, traffic spike, data issue, or dependency incident?
- What is the customer, revenue, or trust impact?
- Are leadership concerns tied to a metric mismatch, segment pain, or strategic expectations?
### What a Strong Answer Covers
- Incident containment, timeline, segmentation, logs, metrics, traces, and dependency checks.
- Root-cause categories such as release regression, capacity, bad data, feature-store lag, dependency outage, model version, or abnormal traffic.
- Reliability fixes prioritized by impact, effort, risk reduction, and time-to-value.
- SLOs, error budgets, runbooks, canaries, auto-rollback, ownership, and postmortem action tracking.
- ROI formula and cost-savings model.
- How to handle metrics that look healthy but do not match leadership or user pain.
### Follow-up Questions
- How would you prioritize between capacity work and model-quality work?
- What would you put in the postmortem?
- How would you make teams accountable without creating blame?
- What if the average SLA is fine but enterprise customers are unhappy?
Quick Answer: Prepare a TPM answer for diagnosing SLA drops in an ML platform. Covers incident containment, RCA, reliability fixes, ROI, accountability, SLOs, postmortems, and handling misleading headline metrics.