What operations experience do you have? Describe on-call participation, incident response and management practices, SLO/SLI definition and tracking, runbook creation, change management, and the tooling you used. Provide concrete examples of a high-severity incident you handled, your role and decisions, postmortem outcomes, and measurable impact.
Quick Answer: This question evaluates a candidate's production operations competency, covering on-call practices, incident response and management, SLO/SLI definition and tracking, runbook creation and maintenance, change management, observability and CI/CD tooling, and post-incident analysis within the Behavioral & Leadership category for software engineering interviews. It is commonly asked to verify real-world operational judgment and measurable impact—using metrics like MTTA/MTTR, error rates, latency percentiles, and availability—and primarily tests practical application and leadership in site reliability and incident management rather than purely conceptual understanding.
Solution
# How to Answer: Structure, Examples, and Best Practices
Below is a step-by-step guide and a sample answer you can adapt. Aim for a 60–90 second overview, then be ready to dive deep on any area.
## 1) 60-second overview template
- Scope: “I’ve spent X years owning services in production, participating in a Y-person on-call rotation.”
- Incident practice: “We use SEV levels, an incident commander model, and blameless postmortems.”
- Reliability: “I defined SLIs/SLOs for A and B; we track error budgets and burn-rate alerts.”
- Operations assets: “I maintain runbooks/playbooks and automated remediation for common faults.”
- Change management: “We ship via canaries/feature flags with auto-rollback guards.”
- Tooling: “PagerDuty, Datadog/Prometheus/Grafana, Splunk/ELK, GitHub Actions/Argo, incident.io/Statuspage.”
- Impact: “Reduced MTTR by Z%, cut alert noise by W%, maintained 99.9x% availability.”
## 2) Detailed talking points and examples
### On-call participation
- Rotation: 6–10 engineers, weekly primary/secondary coverage, handoffs with context notes and current hot issues.
- Hygiene: Alert review weekly; consolidate duplicate alerts; convert recurring alerts to engineering work.
- Metrics: MTTA (mean time to acknowledge) target < 5 minutes; page volume ≤ 1 actionable/night on average.
### Incident response & management
- Severity: SEV-1 (critical user/business impact), SEV-2 (degraded), SEV-3 (localized/limited).
- Roles: Incident Commander (IC), Communications Lead, Ops/SME Leads. IC coordinates; others execute and update stakeholders.
- Flow: Auto-page → IC appoints roles → war room (Slack/Zoom) → status updates cadence (e.g., 15 min) → mitigation first, then diagnosis → postmortem within 3–5 business days.
- Tools: PagerDuty/OPSGenie, Slack with incident bot, Zoom/Meet, status page, incident.io/FireHydrant for timelines and follow-ups.
### SLO/SLI definition and tracking
- SLIs (examples):
- Availability: request success rate (non-5xx responses) from user edge.
- Latency: p95 or p99 below threshold for key endpoints.
- Correctness: ratio of validated successful outcomes (e.g., payment succeeds end-to-end).
- SLOs (examples):
- 99.9% monthly availability (error budget ≈ 43.2 minutes/month).
- p95 latency < 300 ms for read endpoints, < 800 ms for write endpoints, 99% of the time.
- Error budget policy: If burn rate is high, slow or freeze changes; prioritize reliability work.
- Alerting: Multi-window, multi-burn-rate alerts (e.g., fast-burn 1-hour window and slow-burn 6–24-hour window) to reduce noise and catch both spikes and smoldering issues.
- Reporting: Weekly SLO review; monthly error budget report and top reliability risks.
Formulas (plain):
- Availability = 1 - (minutes of user-impacting downtime / total minutes)
- Error budget used (%) = (allowed errors consumed / total allowed errors) * 100
### Runbooks
- Structure:
- Context: service, owners, dependencies, dashboards, logs, run commands.
- Symptoms & diagnosis: what to check first (graphs, logs, feature flags, recent deploys).
- Remediation steps: safe, ordered, reversible actions with expected outcomes.
- Rollback/disable steps: scripts/commands and validation checks.
- Escalations: who and when; vendor contacts.
- Quality: Version-controlled (Markdown), linked from alerts, tested via game days, includes time-boxes (e.g., spend ≤ 10 minutes on path A before escalating).
### Change management
- Process: PR with 2 reviewers; CI tests; deploy to staging → canary (1–5%) → progressive rollout (25/50/100) with automated health checks.
- Risk controls: Feature flags for risky paths, automatic rollback on SLO breach, deployment freeze during major events, RFCs for high-risk changes.
- Metrics: Change fail rate, mean time to restore (MTTR), deployment frequency, rollback rate.
### Tooling (examples)
- Observability: Datadog or Prometheus + Grafana; OpenTelemetry for traces.
- Logging: Splunk or ELK; distributed tracing via Datadog APM/Jaeger.
- On-call/alerting: PagerDuty/OPSGenie.
- CI/CD: GitHub Actions, Jenkins, Argo Rollouts/Argo CD/Spinnaker.
- Config/infra: Terraform, Helm, Kubernetes.
- Incident management: incident.io/FireHydrant; Statuspage/Uptime checks.
- Docs/runbooks: Markdown in repo, Confluence/Wiki, links embedded in alerts.
## 3) Sample high-severity incident (STAR)
- Situation: Peak traffic on a Friday, our Checkout API started returning elevated 5xx errors and p95 latency spiked from 250 ms to 4 s. Error budget for the month was at 60%; revenue at risk was high.
- Task: As on-call primary, I took the Incident Commander role within 3 minutes. Goals: stop the bleeding, restore service, and minimize financial/user impact.
- Actions:
- Declared SEV-1, opened a Slack incident channel, assigned Comms Lead and SME for payments.
- Froze deployments and enabled maintenance mode for non-critical features (degrade recommendations, not checkout).
- Used dashboards to correlate a spike in cache misses with a rollout 20 minutes prior. Suspected cache thrash causing DB saturation.
- Executed runbook steps: toggled a feature flag to revert to the prior pricing cache strategy; scaled read replicas; set connection pool caps and enabled rate limiting to protect the primary DB.
- Verified improvement via SLIs (5xx rate and p95 latency). Rolled back the canary to the previous version.
- Scheduled a postmortem and captured a clean incident timeline using incident tooling.
- Results:
- Time to acknowledge (MTTA): 2 minutes; time to mitigate: 12 minutes; full recovery: 31 minutes.
- Peak 5xx: 3.1% of requests for ~18 minutes; estimated revenue impact reduced by ~40% due to rapid mitigation.
- Root cause: config change reduced cache TTL from 15 minutes to 15 seconds during a rollout, causing cache evictions and DB overload.
- Postmortem outcomes:
- Added pre-deploy config validation and canary guardrails for cache config.
- Implemented circuit breakers and backpressure to protect DB.
- Introduced burn-rate alerts tied to availability and latency SLIs.
- Authored a detailed cache-thrashing runbook and built a synthetic check for cache churn.
- Measurable impact over next quarter: MTTR reduced from 42 → 14 minutes (−67%), paging volume −35%, availability improved from 99.88% → 99.95%, zero SEV-1 incidents for 2 quarters.
## 4) Common pitfalls and guardrails
- SLIs measured at the wrong point (e.g., upstream vs user edge) can hide problems. Measure user-perceived outcomes.
- Overly tight SLOs cause alert fatigue; set targets that reflect desired reliability and realistic budgets.
- Alerting on every internal metric creates noise; prefer SLO burn alerts and a few high-signal symptoms.
- Runbooks without time boxes cause thrash; define escalation triggers and limits.
- Canary without automated health checks is just staging in production; enforce objective rollback criteria.
- Practice: Run game days, chaos tests, and DR failovers to validate runbooks and paging paths.
## 5) Quick template you can adapt
- On-call: “8-engineer rotation, weekly primary/secondary, PagerDuty. Reduced pages per week from ~14 to ~8 by consolidating alerts and fixing top 3 recurring issues.”
- Incident management: “SEV-1/2/3 model with IC/Comms/SME roles; 15-min update cadence; blameless postmortems within 5 days.”
- SLO/SLI: “99.9% availability and p95 latency SLOs; track error budgets and multi-window burn alerts; weekly SLO review.”
- Runbooks: “Markdown in repo, linked in alerts; tested quarterly via game days; include rollback and safety checks.”
- Change management: “Trunk-based, 2 approvals, canary + progressive delivery, feature flags, auto-rollback on SLO breach.”
- Tooling: “Datadog, Grafana/Prometheus, Splunk, PagerDuty, GitHub Actions, Argo Rollouts, incident.io.”
- Impact: “MTTR −60%, alert noise −30%, 99.95% availability over last 6 months.”
Use the STAR method for your incident example, quantify outcomes, and be ready to deep-dive into any decision, metric, or tool you mention.