Design a GPU inference API

Q: Design a GPU inference API

This is a ML System Design interview question from HubSpot for Software Engineer roles. View the full question and solution on PracHub.

Q: How do I approach ML System Design interview questions?

ML System Design questions require understanding of core concepts and practice. PracHub provides solutions with explanations to help you master ml system design interviews.

Question

System Design: GPU-Backed Inference Platform and API

You are designing a production inference platform to serve deep learning models (vision, ranking, and LLMs) at scale. Traffic is bursty and multi-tenant. The GPU fleet is heterogeneous (e.g., T4, A10, A100/H100), and you must support multiple models per GPU while meeting latency SLOs.

Design the system and explain key trade-offs. Specifically cover:

API shape
- Synchronous vs asynchronous semantics
- Streaming responses
- Backwards-compatible schema and API versioning
Request schema and versioning
- Request/response schemas, idempotency, and model version pinning vs aliases
Request routing
- Regional routing, capacity-aware routing, and per-model queues
Batching strategy to maximize GPU utilization
- Micro/continuous batching, padding, admission control; handling LLM prefill vs decode
Model loading and cold-start mitigation
- Model registry, artifact formats, weight caching/warm pools
Multi-model hosting on shared GPUs
- Memory management, MIG/MPS, LoRA/adapters, isolation vs utilization trade-offs
Autoscaling across heterogeneous nodes
- Metrics to scale on, predictive vs reactive, bin-packing/placement
Placement strategy
- Scheduling by GPU type, memory, model affinity, anti-affinity for HA
Observability
- Latency (end-to-end and queueing), throughput, GPU metrics (utilization, memory, SM occupancy)
Rate limiting
- Per-tenant quotas, concurrency limits, fairness
Multi-tenant isolation
- Noisy-neighbor controls, security, priority tiers
Failure handling
- Retries, timeouts, idempotency, circuit breakers
- Canary and A/B rollouts, shadowing
High-level architecture
- Control plane vs data plane, main components and request flows

Assume typical production SLOs (e.g., p95 latency < 300 ms for small models; streaming for LLMs), payloads up to ~10 MB, and that you need to operate multi-region.

Provide a concise, well-structured design with diagrams described in words and call out the most important trade-offs.

Design a GPU inference API

System Design: GPU-Backed Inference Platform and API

Solution

Comments (0)