Design a cloud AI inference platform

Q: Design a cloud AI inference platform

This is a ML System Design interview question from Lambda for Software Engineer roles. View the full question and solution on PracHub.

Q: How do I approach ML System Design interview questions?

ML System Design questions require understanding of core concepts and practice. PracHub provides solutions with explanations to help you master ml system design interviews.

Question

System Design: Cloud AI Inference Platform for Real-Time and Batch

Context

Design a multi-tenant cloud platform that serves machine learning models for both real-time (online) and batch (offline) workloads. The platform should support multiple model frameworks and versions, meet latency and throughput targets, and provide strong isolation, observability, and cost control.

Requirements

Model Packaging and Versioning
- How will models (artifacts + code + environment) be packaged, versioned, and promoted across environments?
Hardware Selection
- When to use CPU vs GPU for different models/workloads? Consider quantization/compilation and GPU sharing.
Autoscaling Strategy
- Horizontal/vertical autoscaling for online and batch; scale-to-zero for idle models.
Request Routing
- Routing by model/version/tenant; consistent hashing; priority and rate limiting; streaming for tokens.
Multi-Tenant Isolation
- Compute/storage/network isolation; quotas; fairness; GPU partitioning.
SLO Targets
- Propose latency and throughput targets for representative model types (e.g., small classifiers, embeddings, LLM text generation) and batch SLAs.
Cost Controls
- Right-sizing; spot/preemptible instances; model optimization (quantization, distillation); bin packing; budgets/chargeback.
Observability
- Tracing, metrics, logs; SLI/SLOs; GPU metrics; model quality/drift.
Safe Rollouts
- Canary and shadow deployments; rollback criteria; guardrails.
Failure Handling
- Retries/backoff; circuit breaking; regional failover; GPU OOM; degraded modes.
Kubernetes Integration
- Ingress, scheduling (GPUs/MIG), operators/CRDs, HPA/KEDA, service mesh, persistent caches.
Model Registry Integration
- Registry choices, signatures/schemas, lineage; CI/CD from registry to serving.
Industry Comparison
- Briefly compare to managed offerings and open-source stacks.

Design a cloud AI inference platform

System Design: Cloud AI Inference Platform for Real-Time and Batch

Context

Requirements

Solution

Comments (0)