Describe a service performance challenge
Company: TikTok
Role: Software Engineer
Category: Behavioral & Leadership
Difficulty: medium
Interview Round: Technical Screen
Describe a challenging time you had to improve a production service’s performance. What was the context and target metric (e.g., p99 latency, throughput, error rate)? How did you diagnose the bottleneck, what experiments or measurements did you run, what changes did you implement, and what was the measurable before/after impact? What trade-offs and risks did you manage, and what did you learn?
Quick Answer: This question evaluates competency in production performance engineering, observability-driven diagnosis, experimentation and impact quantification, as well as leadership in prioritizing trade-offs and risk mitigation.
Solution
# A strong framework and an example answer
Use STAR with metrics: Situation, Task, Actions, Results, plus Trade-offs/Learnings. Make the metric and the bottleneck crystal clear, and quantify impact.
## Example answer (concise but concrete)
Situation
- I owned the read path for a personalized feed API handling ~5k RPS. We were breaching our SLO (p99 <= 500 ms) during peak; p99 latency had regressed to 1.8 s and error rate spiked to 1.2% under load.
Task
- Reduce p99 latency below 500 ms without increasing error rate, ideally raising sustainable throughput for an upcoming traffic event.
Actions
1) Instrument and localize the bottleneck
- Added RED/USE dashboards (Rate, Errors, Duration; Utilization, Saturation, Errors) and increased tracing sample to 20% with OpenTelemetry. Broke latency by hop: API -> feature service -> Postgres -> external profile service.
- Load-tested with k6 to reproduce peak (up to 8k RPS), plotting latency vs. concurrency to spot queueing. Applied Little’s Law (L = λW) to estimate backlog: at 6k RPS and p99=1.8 s, queue length ≈ 10.8k requests at tail.
- Profiles (pprof) showed high allocations in request fan-out and TLS handshake overhead from frequent new connections.
- Traces revealed sequential downstream calls (N=4) consuming ~900 ms, a hot Postgres query with a missing composite index, and occasional retries amplifying tail latency.
2) Validate hypotheses with micro-experiments
- Enabled HTTP/2/gRPC keep-alives and a 1k connection pool in a canary: handshake time dropped from ~40 ms/call to <5 ms.
- EXPLAIN ANALYZE on the top query showed a sequential scan (~1.2 s worst-case). Adding an index on (user_id, recency) in staging brought it to ~70 ms.
- Converted sequential fan-out to parallel futures with a 300 ms per-call deadline and a 350 ms overall budget; observed partial completion helped p99.
- Prototyped Redis caching for computed features with TTL=5 min; measured 70–80% hit rate in staging with synthetic traffic.
3) Implement changes safely
- Database: Added the composite index and rewrote the WHERE clause to use it; deployed off-peak. Verified plan stability.
- Caching: Introduced Redis with cache-aside write, TTL 5 min + jitter. Set up hit-rate, miss-rate, and stale-read dashboards.
- Concurrency: Parallelized downstream calls with budgeted deadlines; added circuit breakers and bulkheads to isolate failures.
- Networking: Enabled connection pooling and gRPC keep-alive; tuned retry policy to bounded exponential backoff with jitter and max 1 retry.
- GC/allocs (Go): Reduced allocations with object pooling; tuned GOGC from 100 to 150 after verifying headroom.
- Rollout: Feature flags + 5% canary -> 25% -> 100%; error budget guardrails and automated rollback.
Results (Before → After)
- p99 latency: 1.8 s → 420 ms (−77%).
- p95 latency: 750 ms → 210 ms.
- Error rate: 1.2% → 0.2% under peak.
- Throughput (sustainable without SLO breach): 3k RPS → 9k RPS.
- DB QPS: −60% due to caching; Redis hit rate stabilized at ~72%.
- Cost trade-off: +8% memory/infra for Redis; acceptable for SLO gains.
Trade-offs and risks
- Cache staleness vs freshness: chose TTL=5 min with jitter; added explicit bust on writes for critical entities.
- Complexity: Parallel fan-out increased code complexity and potential race conditions; mitigated with deadlines, context propagation, and comprehensive tests.
- Retries and tail latency: Limited retries to prevent retry storms; used backoff + jitter; added rate limiting and load shedding during brownouts.
- Index maintenance: Monitored write amplification and table bloat; scheduled periodic vacuum/analyze.
Learnings
- Tackle the largest contributor first (Amdahl’s Law): the sequential fan-out + DB scan dominated tail latency.
- Tail metrics (p95/p99) and saturation are more actionable than averages.
- Budgeted timeouts and partial results can drastically improve tail behavior.
- Ship in small, observable steps with feature flags and canaries; protect with error-budget policies.
## How to structure your own story
- Context: “Service X, scale Y RPS, users Z; SLO/metric at risk.”
- Bottleneck: “We measured and found A (e.g., sequential I/O, missing index, GC pauses, TLS handshakes). Tools: tracing, profiling, load test.”
- Experiments: “We tested B behind flags; measured improvement C in staging/canary.”
- Changes: “Implemented D (caching, indexing, parallelism, pooling, GC tuning), with E (timeouts, circuit breakers) for safety.”
- Impact: “p99 from M to N; throughput from P to Q; error rate from R to S; secondary effects.”
- Trade-offs/Risks: “Staleness, cost, complexity; mitigations.”
- Learnings: “Principles you’ll reuse.”
## Small numeric intuition you can reuse
- Caching lowers average latency roughly by E[L] ≈ hit_rate × L_hit + (1 − hit_rate) × L_miss. Example: 70% hits at 20 ms, 30% misses at 300 ms → E[L] ≈ 0.7×20 + 0.3×300 = 104 ms. While this is mean, improving cache often narrows tail by avoiding slow paths.
- Little’s Law: L = λW. At λ=6k RPS and p99 W≈1.8 s, the tail queue can accumulate ≈10.8k requests, explaining cascading timeouts.
## Common pitfalls and guardrails
- Don’t optimize on averages; track p95/p99, saturation, and error budgets.
- Unbounded retries worsen tail latency; add backoff + jitter and upper bounds.
- Parallelism without budgets can increase load on dependencies; use deadlines and bulkheads.
- Caches need observability (hit/miss, staleness), jittered TTLs, and safe fallbacks.
- Always canary and have automated rollback; verify with load tests that match real traffic distributions.