Design a cloud AI inference platform

Q: Design a cloud AI inference platform

This question evaluates competency in ML system design and operational engineering, covering model packaging and versioning, hardware selection (CPU vs GPU), autoscaling, request routing, multi-tenant isolation, SLO-driven observability, cost controls, and failure handling within a cloud inference platform.

Q: How do I approach ML System Design interview questions?

ML System Design questions require understanding of core concepts and practice. PracHub provides solutions with explanations to help you master ml system design interviews.

Q: What difficulty level is this interview question?

This is a hard difficulty ML System Design question, commonly asked during HR Screen rounds at Lambda.

Q: What role is this question designed for?

This question is commonly asked for Software Engineer candidates at Lambda during technical interviews.

Question

System Design: Cloud AI Inference Platform for Real-Time and Batch

Context

Design a multi-tenant cloud platform that serves machine learning models for both real-time (online) and batch (offline) workloads. The platform should support multiple model frameworks and versions, meet latency and throughput targets, and provide strong isolation, observability, and cost control.

Requirements

Model Packaging and Versioning
- How will models (artifacts + code + environment) be packaged, versioned, and promoted across environments?
Hardware Selection
- When to use CPU vs GPU for different models/workloads? Consider quantization/compilation and GPU sharing.
Autoscaling Strategy
- Horizontal/vertical autoscaling for online and batch; scale-to-zero for idle models.
Request Routing
- Routing by model/version/tenant; consistent hashing; priority and rate limiting; streaming for tokens.
Multi-Tenant Isolation
- Compute/storage/network isolation; quotas; fairness; GPU partitioning.
SLO Targets
- Propose latency and throughput targets for representative model types (e.g., small classifiers, embeddings, LLM text generation) and batch SLAs.
Cost Controls
- Right-sizing; spot/preemptible instances; model optimization (quantization, distillation); bin packing; budgets/chargeback.
Observability
- Tracing, metrics, logs; SLI/SLOs; GPU metrics; model quality/drift.
Safe Rollouts
- Canary and shadow deployments; rollback criteria; guardrails.
Failure Handling
- Retries/backoff; circuit breaking; regional failover; GPU OOM; degraded modes.
Kubernetes Integration
- Ingress, scheduling (GPUs/MIG), operators/CRDs, HPA/KEDA, service mesh, persistent caches.
Model Registry Integration
- Registry choices, signatures/schemas, lineage; CI/CD from registry to serving.
Industry Comparison
- Briefly compare to managed offerings and open-source stacks.

Design a cloud AI inference platform

Quick Overview

System Design: Cloud AI Inference Platform for Real-Time and Batch

Context

Requirements

Solution

Comments (0)