Design an ML inference orchestration platform

Q: Design an ML inference orchestration platform

This is a ML System Design interview question from Palo Alto Networks for Software Engineer roles. View the full question and solution on PracHub.

Q: How do I approach ML System Design interview questions?

ML System Design questions require understanding of core concepts and practice. PracHub provides solutions with explanations to help you master ml system design interviews.

Question

System Design: ML Inference Orchestration Platform

Context

You are designing a multi-tenant platform that exposes several ML models as independent services (for example: text classification, embeddings generation, and re-ranking). External clients should be able to invoke end-to-end workflows that chain these models. The platform must support both low-latency synchronous requests and higher-latency asynchronous jobs.

Assume:

External clients send requests with an input document (text or metadata), a chosen workflow, and optional parameters (model versions, thresholds, etc.).
Individual model services are independently deployed and versioned.
Workflows may include branching and parallel steps (e.g., compute embeddings while running classification, then re-rank results).

Task

Design the orchestration platform and address the following:

Request Routing and Orchestration
- How do external requests arrive and get routed to workflow execution?
- How are workflows represented (DAG/state machine) and executed?
- Synchronous vs. asynchronous execution.
Data Flow Between Models
- How are inputs/outputs passed between steps (in-memory vs. references)?
- Handling large payloads and parallel branches.
Schemas and Validation
- Define request/response schemas and validation strategy across services.
- Versioning and backward compatibility.
Intermediate Storage and Caching
- Where to store intermediate artifacts and how to cache reusable results (e.g., embeddings by content hash)?
- TTLs and invalidation.
Final Results Storage
- How to persist final outputs for retrieval, analytics, and audit.
- Retention and multi-tenant partitioning.
Model and Version Management
- Model registry, version pinning, canary/AB testing, and rollback.
Failure Handling and Retries
- Timeouts, retries with backoff/jitter, idempotency, partial failures, and fallbacks.
Monitoring and Traceability
- Metrics, logs, and distributed tracing across the workflow.
- Per-tenant visibility and cost/QPS attribution.
Scaling and Deployment
- Autoscaling model services (CPU/GPU), concurrency controls, warm pools.
- Multi-region, high availability, and deployment strategies.
Multi-tenant Auth and Rate Limiting

Authentication/authorization, quotas, rate limiting, and isolation.

RPC vs. REST Between Services

Justify when to use RPC versus REST for internal calls.
Discuss trade-offs in latency, schema evolution, observability, and backward compatibility.

Be explicit about assumptions and provide rationales for key choices.

Design an ML inference orchestration platform

System Design: ML Inference Orchestration Platform

Context

Task

Solution

Comments (0)