This question evaluates competency in diagnosing and mitigating production overloads in Kubernetes-based microservice architectures, emphasizing observability, autoscaling behavior, resource constraints, and incident-response reasoning.
You are given a Kubernetes-based microservices system that is currently overloaded, exhibiting high tail latency (p95/p99), request timeouts, and autoscaling thrash (rapid scale up/down). Design an end-to-end approach to debug and mitigate the incident.
Provide:
Assume a typical setup with: Kubernetes, Horizontal Pod Autoscaler (HPA), a metrics stack (e.g., Prometheus/Grafana), logs (e.g., centralized logging), and tracing (e.g., OpenTelemetry + Jaeger/Tempo). If a component is missing, state a pragmatic alternative.
Login required