SLI vs SLO vs SLA for a Web API; Error Budgets; Monitoring and Alerting Design
Context: You are designing reliability goals and on-call policies for a production web API that serves JSON over HTTPS. Requests include a mix of GET/POST endpoints. You need to define what you measure (SLIs), targets (SLOs), the contractual promise (SLA), plan an error budget for a quarterly SLO, and design monitoring/alerting that minimizes alert fatigue.
Tasks
-
Define and contrast SLI, SLO, and SLA. Give concrete SLI examples for:
-
Availability (success rate)
-
Latency (e.g., request duration under a threshold)
-
Given a quarterly target SLO, define a reasonable error budget, and show how you would apportion and track its consumption over time.
-
Design a monitoring and alerting system that minimizes alert fatigue:
-
Choose which signals to alert on and why
-
Set alert thresholds relative to SLOs
-
Aggregate and deduplicate alerts
-
Apply multi-window/multi-burn-rate policies
-
Define escalation, silencing, and runbook practices