How to triage slow service alerts
Company: Bytedance
Role: Site Reliability Engineer
Category: Software Engineering Fundamentals
Difficulty: hard
Interview Round: Technical Screen
A production alert indicates that a web service is experiencing high latency or slow responses. As an SRE, describe how you would **triage**, **investigate**, and **mitigate** the issue.
Your answer should cover:
- how to confirm the alert is real and assess severity,
- how to identify the blast radius and user impact,
- what Linux-, host-, network-, application-, and dependency-level checks you would run,
- what immediate mitigation steps you would take to reduce impact,
- how you would communicate during the incident and drive the service toward recovery.
Quick Answer: This question evaluates operational troubleshooting and incident management competencies — system observability, severity assessment, blast-radius and user-impact analysis, dependency and performance diagnostics, mitigation decision-making, and incident communication.