Reliability, Performance, And Infrastructure Operations

What's being tested

Interviewers are probing whether you can operate production services under real traffic: define service-level indicators, diagnose latency or failure spikes, design overload protection, and make tradeoffs between reliability, cost, and feature completeness. For TikTok-scale systems, a small inefficiency or missing guardrail can cascade across millions of requests, so a Software Engineer is expected to reason beyond “my code works” into capacity, observability, dependencies, and graceful degradation. Strong answers show a loop: measure with the right signals, form hypotheses, isolate bottlenecks, mitigate safely, and verify impact. You should be ready to discuss concrete mechanisms like `p99` latency, `Kubernetes` pod failures, `Redis` eviction policies, circuit breakers, backpressure, and SLO-based prioritization.

Core knowledge

SLI/SLO/SLA are distinct: an SLI is a measured signal like `availability`, `error_rate`, or `p99_latency`; an SLO is the internal target, such as `99.9%` successful requests under `300ms`; an SLA is the external contract with penalties.
Latency percentiles matter more than averages for user-facing systems. Track `p50`, `p95`, `p99`, and sometimes `p999`; mean latency can hide tail spikes caused by lock contention, GC pauses, noisy neighbors, network retries, or overloaded downstream services.
Little’s Law helps reason about queues: $L = \lambda W$ , where $L$ is in-flight work, $\lambda$ is arrival rate, and $W$ is average time in system. If arrival rate exceeds service capacity, queue length grows unbounded and tail latency explodes.
Observability should include metrics, logs, and traces. Use RED metrics for request services: Rate, Errors, Duration. Use USE metrics for resources: Utilization, Saturation, Errors across CPU, memory, disk, network, thread pools, and connection pools.
Performance diagnosis should move layer by layer: client latency, edge/load balancer, application handler, downstream RPCs, database queries, cache hit rate, host resources, and deployment changes. A strong answer avoids guessing and correlates symptoms with deploys, traffic shape, dependency health, and saturation.
Load shedding protects the system by rejecting or degrading work before collapse. Common tools include token-bucket rate limiting, bounded queues, priority classes, request deadlines, adaptive concurrency limits, circuit breakers, and returning `429`, `503`, cached, or partial responses.
Backpressure pushes overload signals upstream instead of silently accumulating work. Examples include bounded worker queues, non-blocking rejection, `gRPC` deadlines, client retry budgets, and avoiding infinite retries that amplify load during an incident.
Circuit breakers isolate failing dependencies. A typical state machine has closed, open, and half-open states; open after error-rate or timeout thresholds, periodically probe in half-open, and close only after sustained recovery. Pair with fallbacks and timeouts.
Kubernetes failure debugging starts with pod state and events: `CrashLoopBackOff`, `OOMKilled`, `ImagePullBackOff`, readiness probe failures, CPU throttling, node pressure, misconfigured resource requests/limits, bad config maps, or dependency startup ordering. Check `kubectl describe pod`, logs, events, and rollout history.
Capacity planning should estimate peak QPS, per-request CPU/memory, dependency limits, and headroom. If one instance handles `500` QPS at target `p99`, and peak is `20k` QPS, you need at least `40` instances before redundancy, zone failure tolerance, and deploy surge capacity.
Redis tradeoffs include persistence, memory policy, replication, and clustering. `RDB` snapshots are compact but can lose recent writes; `AOF` improves durability with write amplification; eviction policies like `allkeys-lru`, `volatile-ttl`, or `noeviction` must match whether stale or missing cache entries are acceptable.
Caching failure modes include stampedes, hot keys, stale data, and cache/database inconsistency. Use TTL jitter, request coalescing, negative caching, hot-key sharding, single-flight locks, and clear ownership of whether `Redis` is a cache, primary store, queue, or coordination primitive.

Worked example

For “Design overload protection with load shedding”, start by clarifying the service shape: is it a stateless HTTP API, what is peak QPS, which requests are user-critical, what downstream dependencies exist, and what SLO must be preserved under overload? Then declare an assumption such as: “I’ll design for a user-facing read-heavy service where protecting `p99` latency and availability is more important than serving every low-priority request.” A strong answer can be organized around four pillars: admission control at the edge, bounded work inside the service, dependency isolation, and graceful degradation.

At the edge, propose per-user or per-token rate limits using token buckets, plus global adaptive limits when fleet saturation crosses thresholds. Inside the service, use bounded queues, request deadlines, worker-pool limits, and fast rejection instead of letting memory or threads grow until the process dies. For dependencies, add circuit breakers, bulkheads, timeout budgets, and fallback paths such as cached responses or reduced payloads. Explicitly flag the tradeoff: aggressive shedding protects the majority of users and keeps recovery fast, but it can reject legitimate burst traffic, so limits should be observable, configurable, and tested under load. Close by saying that with more time you would add chaos/load tests, retry-budget enforcement on clients, and dashboards showing accepted QPS, shed QPS, saturation, and user-visible error rates.

A second angle

For “Explain Redis design, persistence, and scaling”, the same reliability thinking applies but the bottleneck is now stateful infrastructure rather than request admission. You still need to ask whether `Redis` is used as a cache, session store, rate limiter, lock service, or primary-ish data store, because each use case changes durability and consistency expectations. For a cache, `allkeys-lru`, TTL jitter, and read-through fallback may be fine; for rate limiting, atomicity via `INCR` plus expiry or Lua scripting matters more. Scaling requires discussing memory limits, hot keys, replication lag, failover behavior, and cluster slot distribution. The key transfer is that reliability is not “make `Redis` fast”; it is choosing failure behavior intentionally when memory fills, a node dies, or traffic concentrates on one key.

Common pitfalls

Pitfall: Treating reliability as just “add more machines.”

Horizontal scaling helps only if the bottleneck is stateless compute. If the real issue is a saturated database connection pool, hot `Redis` key, synchronized retries, slow external RPC, or unbounded queue, adding pods may amplify pressure downstream and worsen tail latency.

Pitfall: Jumping to fixes before defining symptoms and success metrics.

A weak answer says “I would cache it” or “I would optimize the query” without naming the SLI being improved. A better answer says: “The regression is `p99` latency from `220ms` to `900ms`, correlated with a deploy and increased DB time; I’ll roll back or mitigate first, then profile and validate against the SLO.”

Pitfall: Staying too shallow on operational mechanisms.

Saying “use monitoring, rate limiting, and `Kubernetes`” is not enough. Interviewers expect you to explain what you would monitor, where you would enforce limits, what happens to rejected requests, how readiness/liveness probes differ, and how the system recovers without causing retry storms.

Connections

Interviewers may pivot from here into distributed systems consistency, database indexing and query optimization, microservice API design, or incident response and postmortems. They may also connect overload protection to rate limiter design, `Redis` atomic operations, `Kafka` consumer lag, or `Kubernetes` autoscaling behavior.

What's being tested

Core knowledge

Worked example

A second angle

Common pitfalls

Connections

Further reading

Featured in interview prep guides

Practice questions

Related concepts