Scenario
You are on-call for a high-throughput network service (e.g., a TCP/HTTP server). Under load, users report:
-
Throughput suddenly drops
-
p99 latency increases significantly
-
CPU may be high or low depending on the incident
The server uses a concurrency model based on an event loop and OS I/O multiplexing (e.g., epoll), and may use a thread pool for some work.
Task
Walk through how you would debug and improve performance. Your interviewer may ask about:
-
Connection handling and backlog
-
Event loop behavior
-
Blocking vs non-blocking I/O
-
I/O multiplexing (
select/poll/epoll
, edge vs level triggered)
-
Where
io_uring
might help vs where it would not
-
Typical bottlenecks: syscalls, context switching, lock contention, head-of-line blocking, buffer management
Output expected
Explain a structured approach:
-
What you would measure/collect first
-
How you would narrow down the bottleneck
-
Concrete fixes and trade-offs
-
How you would validate improvements and avoid regressions