How do I approach ML System Design interview questions?

ML System Design questions require understanding of core concepts and practice. PracHub provides solutions with explanations to help you master ml system design interviews.

What difficulty level is this interview question?

This is a hard difficulty ML System Design question, commonly asked during Onsite rounds at HubSpot.

What role is this question designed for?

This question is commonly asked for Software Engineer candidates at HubSpot during technical interviews.

Design a GPU inference API | HubSpot Interview Question

Quick Overview

This question evaluates a candidate's ability to design scalable, multi-tenant GPU-backed inference platforms, testing competencies in ML systems engineering, distributed systems, API design, resource and memory management, scheduling, autoscaling, observability, and multi-tenant isolation.

System Design: GPU-Backed Inference Platform and API

You are designing a production inference platform to serve deep learning models (vision, ranking, and LLMs) at scale. Traffic is bursty and multi-tenant. The GPU fleet is heterogeneous (e.g., T4, A10, A100/H100), and you must support multiple models per GPU while meeting latency SLOs.

Design the system and explain key trade-offs. Specifically cover:

API shape
- Synchronous vs asynchronous semantics
- Streaming responses
- Backwards-compatible schema and API versioning
Request schema and versioning
- Request/response schemas, idempotency, and model version pinning vs aliases
Request routing
- Regional routing, capacity-aware routing, and per-model queues
Batching strategy to maximize GPU utilization
- Micro/continuous batching, padding, admission control; handling LLM prefill vs decode
Model loading and cold-start mitigation
- Model registry, artifact formats, weight caching/warm pools
Multi-model hosting on shared GPUs
- Memory management, MIG/MPS, LoRA/adapters, isolation vs utilization trade-offs
Autoscaling across heterogeneous nodes
- Metrics to scale on, predictive vs reactive, bin-packing/placement
Placement strategy
- Scheduling by GPU type, memory, model affinity, anti-affinity for HA
Observability
- Latency (end-to-end and queueing), throughput, GPU metrics (utilization, memory, SM occupancy)
Rate limiting
- Per-tenant quotas, concurrency limits, fairness
Multi-tenant isolation
- Noisy-neighbor controls, security, priority tiers
Failure handling
- Retries, timeouts, idempotency, circuit breakers
- Canary and A/B rollouts, shadowing
High-level architecture
- Control plane vs data plane, main components and request flows

Assume typical production SLOs (e.g., p95 latency < 300 ms for small models; streaming for LLMs), payloads up to ~10 MB, and that you need to operate multi-region.

Provide a concise, well-structured design with diagrams described in words and call out the most important trade-offs.

Quick Overview

System Design: GPU-Backed Inference Platform and API

Design the system and explain key trade-offs. Specifically cover:

API shape

Synchronous vs asynchronous semantics
Streaming responses
Backwards-compatible schema and API versioning

Request schema and versioning

Request/response schemas, idempotency, and model version pinning vs aliases

Request routing

Regional routing, capacity-aware routing, and per-model queues

Batching strategy to maximize GPU utilization

Micro/continuous batching, padding, admission control; handling LLM prefill vs decode

Model loading and cold-start mitigation

Model registry, artifact formats, weight caching/warm pools

Multi-model hosting on shared GPUs

Memory management, MIG/MPS, LoRA/adapters, isolation vs utilization trade-offs

Autoscaling across heterogeneous nodes

Metrics to scale on, predictive vs reactive, bin-packing/placement

Placement strategy

Scheduling by GPU type, memory, model affinity, anti-affinity for HA

Observability

Latency (end-to-end and queueing), throughput, GPU metrics (utilization, memory, SM occupancy)

Rate limiting

Per-tenant quotas, concurrency limits, fairness

Multi-tenant isolation

Noisy-neighbor controls, security, priority tiers

Failure handling

Retries, timeouts, idempotency, circuit breakers
Canary and A/B rollouts, shadowing

High-level architecture

Control plane vs data plane, main components and request flows

Assume typical production SLOs (e.g., p95 latency < 300 ms for small models; streaming for LLMs), payloads up to ~10 MB, and that you need to operate multi-region.

Provide a concise, well-structured design with diagrams described in words and call out the most important trade-offs.

Design a GPU inference API

Quick Overview

System Design: GPU-Backed Inference Platform and API

Solution

Submit Your Answer to Earn 20XP

Design a GPU inference API

Quick Overview

System Design: GPU-Backed Inference Platform and API

Solution

Submit Your Answer to Earn 20XP