Assessing and Improving a Service’s Performance Under High Load
Context
You are responsible for a user-facing, stateless HTTP microservice that handles read-heavy traffic with bursty patterns (e.g., push-notification spikes). The service depends on a cache (e.g., Redis) and a database (e.g., MySQL/Postgres), is deployed behind a load balancer, and auto-scales.
Define how you would assess and improve this service’s performance under high load. Address the following:
-
SLIs/SLOs
-
Define the key Service Level Indicators (SLIs) and Service Level Objectives (SLOs), including latency percentiles, throughput, error rate, availability, and saturation.
-
Instrumentation and Profiling
-
Describe how you would instrument the service (metrics, traces, logs) and how you would profile it (CPU, memory, locks, I/O) to find hotspots.
-
Load-Testing Plan
-
Outline a realistic load-testing plan: workload modeling, test types (ramp, spike, soak, stress), environment setup, data representativeness, and pass/fail criteria.
-
Likely Bottlenecks
-
Identify common bottlenecks under high load (CPU, I/O, database, cache, network, GC, locks, pools) and symptoms you’d look for.
-
Concrete Optimizations
-
Propose specific optimizations and discuss trade-offs (e.g., caching strategies, query/index tuning, batching, backpressure, retries, timeouts, concurrency control, network/protocol tuning, autoscaling).
Make any minimal assumptions you need and be explicit about them.