This question evaluates understanding of GPU inference batching, request queuing and routing, scheduling and autoscaling, throughput–latency trade-offs, multi-model/version management, failure handling, and observability within machine learning serving systems, and is in the ML System Design domain.
Design a system that serves online model-inference requests on GPUs. Requests arrive one at a time from clients, but GPU throughput is much better when compatible requests are grouped into batches.
Discuss how you would design a service that:
Your design should cover the API, queueing model, batching strategy, scheduling policy, worker lifecycle, autoscaling signals, and the main trade-offs.