System Design: ML Inference Orchestration Platform
Context
You are designing a multi-tenant platform that exposes several ML models as independent services (for example: text classification, embeddings generation, and re-ranking). External clients should be able to invoke end-to-end workflows that chain these models. The platform must support both low-latency synchronous requests and higher-latency asynchronous jobs.
Assume:
-
External clients send requests with an input document (text or metadata), a chosen workflow, and optional parameters (model versions, thresholds, etc.).
-
Individual model services are independently deployed and versioned.
-
Workflows may include branching and parallel steps (e.g., compute embeddings while running classification, then re-rank results).
Task
Design the orchestration platform and address the following:
-
Request Routing and Orchestration
-
How do external requests arrive and get routed to workflow execution?
-
How are workflows represented (DAG/state machine) and executed?
-
Synchronous vs. asynchronous execution.
-
Data Flow Between Models
-
How are inputs/outputs passed between steps (in-memory vs. references)?
-
Handling large payloads and parallel branches.
-
Schemas and Validation
-
Define request/response schemas and validation strategy across services.
-
Versioning and backward compatibility.
-
Intermediate Storage and Caching
-
Where to store intermediate artifacts and how to cache reusable results (e.g., embeddings by content hash)?
-
TTLs and invalidation.
-
Final Results Storage
-
How to persist final outputs for retrieval, analytics, and audit.
-
Retention and multi-tenant partitioning.
-
Model and Version Management
-
Model registry, version pinning, canary/AB testing, and rollback.
-
Failure Handling and Retries
-
Timeouts, retries with backoff/jitter, idempotency, partial failures, and fallbacks.
-
Monitoring and Traceability
-
Metrics, logs, and distributed tracing across the workflow.
-
Per-tenant visibility and cost/QPS attribution.
-
Scaling and Deployment
-
Autoscaling model services (CPU/GPU), concurrency controls, warm pools.
-
Multi-region, high availability, and deployment strategies.
-
Multi-tenant Auth and Rate Limiting
-
Authentication/authorization, quotas, rate limiting, and isolation.
-
RPC vs. REST Between Services
-
Justify when to use RPC versus REST for internal calls.
-
Discuss trade-offs in latency, schema evolution, observability, and backward compatibility.
Be explicit about assumptions and provide rationales for key choices.