PracHub
QuestionsPremiumCoachesLearningGuidesInterview Prep
|Home/ML System Design/HubSpot

Design a GPU inference API

Last updated: Mar 29, 2026

Quick Overview

This question evaluates a candidate's ability to design scalable, multi-tenant GPU-backed inference platforms, testing competencies in ML systems engineering, distributed systems, API design, resource and memory management, scheduling, autoscaling, observability, and multi-tenant isolation.

  • hard
  • HubSpot
  • ML System Design
  • Software Engineer

Design a GPU inference API

Company: HubSpot

Role: Software Engineer

Category: ML System Design

Difficulty: hard

Interview Round: Onsite

Design a GPU-backed inference API for serving deep learning models. Describe the API (sync/async), request schema and versioning, request routing, batching strategy to maximize GPU utilization, model loading and cold-start mitigation, and multi-model hosting on shared GPUs. Explain autoscaling across heterogeneous nodes, placement strategy, observability (latency, throughput, GPU metrics), rate limiting, multi-tenant isolation, and failure handling (retries, timeouts, canary/AB rollouts). Provide a high-level architecture and key trade-offs.

Quick Answer: This question evaluates a candidate's ability to design scalable, multi-tenant GPU-backed inference platforms, testing competencies in ML systems engineering, distributed systems, API design, resource and memory management, scheduling, autoscaling, observability, and multi-tenant isolation.

HubSpot logo
HubSpot
Sep 6, 2025, 12:00 AM
Software Engineer
Onsite
ML System Design
6
0

System Design: GPU-Backed Inference Platform and API

You are designing a production inference platform to serve deep learning models (vision, ranking, and LLMs) at scale. Traffic is bursty and multi-tenant. The GPU fleet is heterogeneous (e.g., T4, A10, A100/H100), and you must support multiple models per GPU while meeting latency SLOs.

Design the system and explain key trade-offs. Specifically cover:

  1. API shape
    • Synchronous vs asynchronous semantics
    • Streaming responses
    • Backwards-compatible schema and API versioning
  2. Request schema and versioning
    • Request/response schemas, idempotency, and model version pinning vs aliases
  3. Request routing
    • Regional routing, capacity-aware routing, and per-model queues
  4. Batching strategy to maximize GPU utilization
    • Micro/continuous batching, padding, admission control; handling LLM prefill vs decode
  5. Model loading and cold-start mitigation
    • Model registry, artifact formats, weight caching/warm pools
  6. Multi-model hosting on shared GPUs
    • Memory management, MIG/MPS, LoRA/adapters, isolation vs utilization trade-offs
  7. Autoscaling across heterogeneous nodes
    • Metrics to scale on, predictive vs reactive, bin-packing/placement
  8. Placement strategy
    • Scheduling by GPU type, memory, model affinity, anti-affinity for HA
  9. Observability
    • Latency (end-to-end and queueing), throughput, GPU metrics (utilization, memory, SM occupancy)
  10. Rate limiting
    • Per-tenant quotas, concurrency limits, fairness
  11. Multi-tenant isolation
    • Noisy-neighbor controls, security, priority tiers
  12. Failure handling
    • Retries, timeouts, idempotency, circuit breakers
    • Canary and A/B rollouts, shadowing
  13. High-level architecture
    • Control plane vs data plane, main components and request flows

Assume typical production SLOs (e.g., p95 latency < 300 ms for small models; streaming for LLMs), payloads up to ~10 MB, and that you need to operate multi-region.

Provide a concise, well-structured design with diagrams described in words and call out the most important trade-offs.

Solution

Show

Submit Your Answer to Earn 20XP

Sign in to leave a comment

Loading comments...

Browse More Questions

More ML System Design•More HubSpot•More Software Engineer•HubSpot Software Engineer•HubSpot ML System Design•Software Engineer ML System Design
PracHub

Master your tech interviews with 8,000+ real questions from top companies.

Product

  • Questions
  • Learning Tracks
  • Interview Guides
  • Resources
  • Premium
  • For Universities
  • Student Access

Browse

  • By Company
  • By Role
  • By Category
  • Topic Hubs
  • SQL Questions
  • Compare Platforms
  • Discord Community

Support

  • support@prachub.com
  • (916) 541-4762

Legal

  • Privacy Policy
  • Terms of Service
  • About Us

© 2026 PracHub. All rights reserved.