Design a GPU inference API

Q: Design a GPU inference API

This question evaluates a candidate's understanding of GPU-backed inference platforms, distributed systems and scheduling, model lifecycle and GPU memory management, autoscaling across heterogeneous nodes, multi-tenant isolation, and operational concerns such as observability, reliability, cost controls, and security.

Q: How do I approach ML System Design interview questions?

ML System Design questions require understanding of core concepts and practice. PracHub provides solutions with explanations to help you master ml system design interviews.

Question

System Design: GPU-Backed Multi-Model Inference API

Context

Design a production-grade inference platform for serving multiple ML models (e.g., LLMs, vision, and classic DL models) backed by GPUs. The platform must meet strict latency SLOs for online traffic while achieving high throughput via dynamic batching. It should support model versioning with A/B routing, autoscale across heterogeneous GPU nodes, provide isolation and quotas for multiple tenants, and remain fault-tolerant.

Assume global deployment in at least two regions, gRPC/HTTP-based clients, and a mix of streaming and unary requests.

Requirements

SLOs and Latency: Low-latency online inference with clear SLOs (e.g., P95 time-to-first-byte/token and completion). High availability.
Throughput: Dynamic batching while respecting SLOs.
Models: Multiple models, versioning, A/B testing and routing.
Autoscaling: Across heterogeneous GPU node types (e.g., A10, A100, H100), with pre-warming and scale-to-zero support.
Queueing and Backpressure: Fairness and SLO-aware admission control.
Multi-tenant Isolation: Per-tenant quotas, budgets, and safety limits.
Lifecycle: Model loading, warmup, cache priming.
GPU Memory Management: Weights residency, KV/cache planning, and eviction.
Fault Tolerance: Retries, draining, canarying, and rollbacks.
Architecture: Describe API gateway, scheduler/router, batching layer, workers, model registry, and control plane.
Protocols: Streaming vs unary RPC.
Observability: Metrics, traces, logs.
Cost Controls: Efficiency, scaling policies, and spend guardrails.
Security: AuthN/Z, isolation, artifact integrity, and data protection.

Design a GPU inference API

System Design: GPU-Backed Multi-Model Inference API

Context

Requirements

Solution

Comments (0)

Design a GPU inference API

Overview

System Design: GPU-Backed Multi-Model Inference API

Context

Requirements

Solution

Comments (0)