This question evaluates a candidate's understanding of GPU-backed inference platforms, distributed systems and scheduling, model lifecycle and GPU memory management, autoscaling across heterogeneous nodes, multi-tenant isolation, and operational concerns such as observability, reliability, cost controls, and security.

Design a production-grade inference platform for serving multiple ML models (e.g., LLMs, vision, and classic DL models) backed by GPUs. The platform must meet strict latency SLOs for online traffic while achieving high throughput via dynamic batching. It should support model versioning with A/B routing, autoscale across heterogeneous GPU nodes, provide isolation and quotas for multiple tenants, and remain fault-tolerant.
Assume global deployment in at least two regions, gRPC/HTTP-based clients, and a mix of streaming and unary requests.
Login required