How would you troubleshoot Linux services?
Company: Bytedance
Role: Site Reliability Engineer
Category: Software Engineering Fundamentals
Difficulty: medium
Interview Round: Technical Screen
You are acting as an SRE responsible for Linux-based services in production. Describe how you would handle the following situations:
1. A server reports that the disk is full. Give a step-by-step troubleshooting process, including what commands you would run, what you would check next after each command, what it means if the filesystem is full but you cannot find large files, and how you would identify the process responsible for the disk usage.
2. A backend service is responding slowly. Explain a structured troubleshooting approach, including the monitoring metrics you would inspect, how you would isolate the bottleneck, and common causes of latency at the application, host, network, and dependency layers.
3. Describe several ways to host an application on Linux, such as native installation, containers, and virtual machines. For each option, explain concretely how you would deploy and run the application, as well as the trade-offs.
Quick Answer: This question evaluates proficiency in Linux system administration and site reliability engineering tasks—specifically troubleshooting disk-full conditions, diagnosing backend latency, identifying responsible processes, and comparing deployment approaches such as native installs, containers, and virtual machines.