System Design: Low‑Latency ML Inference API (Real‑Time)
Context
You are designing an in‑region, synchronous inference API used by product surfaces (e.g., ranking, fraud checks, personalization) that require tight latency and high availability. The service must support multiple tenants, safe model rollouts, and strong observability, while controlling cost.
Requirements
-
Target SLOs
-
Propose p50/p95 (and optionally p99) end‑to‑end latency targets and availability targets.
-
Define SLIs and error budgets.
-
API
-
Define request/response schema, including idempotency, model/version selection, and metadata for traceability.
-
Authentication and authorization approach.
-
Rate limiting and quotas.
-
Multitenancy (tenant isolation, quotas, and model routing).
-
Architecture
-
Load balancing and edge protections.
-
Stateless API tier design.
-
Feature retrieval (online store), consistency, and TTLs.
-
Model serving choices (CPU/GPU), dynamic batching, quantization, caching.
-
Autoscaling strategies for API, feature store, and model servers.
-
Release Safety and Experimentation
-
Model versioning and registry.
-
Canary/shadow, rollback criteria.
-
Online A/B (assignment, metrics, guardrails).
-
Observability and Quality
-
Metrics, logs, tracing (end‑to‑end and per stage).
-
Data/feature quality checks and drift detection.
-
Cost and Reliability
-
Cost controls (utilization targets, right‑sizing, caching, tiering).
-
Fallback behavior under partial outages or capacity shortfalls.
-
Security and Compliance
-
Request security, mTLS, secrets management.
-
PII handling, retention, and auditability.
-
Regionalization/data‑sovereignty and disaster recovery plan.
Deliverables
-
A concrete proposal covering the above, including clearly stated numerical targets and trade‑offs.
-
Any assumptions you make that influence the design.