Operations Experience (Behavioral & Leadership — Technical Screen)
Provide a concise but concrete overview of your production operations experience as a software engineer. Address the following:
-
On-call participation
-
Rotation model (size, hours, primary/secondary), responsibilities, and page hygiene.
-
Incident response and management practices
-
Severity definitions, roles (e.g., incident commander), communication, escalation, and tooling.
-
SLO/SLI definition and tracking
-
Key SLIs, SLO targets, error budget policy, alerting strategy, and reporting cadence.
-
Runbook creation and maintenance
-
Structure, content, ownership, testing, and where you store them.
-
Change management
-
Release process, approvals, canary/progressive delivery, rollback, freezes, and risk controls.
-
Tooling used
-
Observability, alerting/on-call, incident management, CI/CD, change control, and documentation.
-
A high-severity incident example
-
Your role and decisions, timeline, key actions, resolution, postmortem outcomes, and measurable impact.
Use specific metrics (e.g., MTTA/MTTR, % error rate, latency percentiles, availability) and concrete examples. A STAR structure (Situation, Task, Action, Result) is encouraged.