Design an ML inference orchestration platform

Q: Design an ML inference orchestration platform

This question evaluates system-design and ML infrastructure competencies, focusing on orchestration of multi-model inference workflows, distributed data flow and storage, versioning and model management, scaling and deployment, failure handling, observability, and multi-tenant security and quota management.

Q: How do I approach ML System Design interview questions?

ML System Design questions require understanding of core concepts and practice. PracHub provides solutions with explanations to help you master ml system design interviews.

Q: What difficulty level is this interview question?

This is a hard difficulty ML System Design question, commonly asked during Onsite rounds at Palo Alto Networks.

Q: What role is this question designed for?

This question is commonly asked for Software Engineer candidates at Palo Alto Networks during technical interviews.

Question

System Design: ML Inference Orchestration Platform

Context

You are designing a multi-tenant platform that exposes several ML models as independent services (for example: text classification, embeddings generation, and re-ranking). External clients should be able to invoke end-to-end workflows that chain these models. The platform must support both low-latency synchronous requests and higher-latency asynchronous jobs.

Assume:

External clients send requests with an input document (text or metadata), a chosen workflow, and optional parameters (model versions, thresholds, etc.).
Individual model services are independently deployed and versioned.
Workflows may include branching and parallel steps (e.g., compute embeddings while running classification, then re-rank results).

Task

Design the orchestration platform and address the following:

Request Routing and Orchestration
- How do external requests arrive and get routed to workflow execution?
- How are workflows represented (DAG/state machine) and executed?
- Synchronous vs. asynchronous execution.
Data Flow Between Models
- How are inputs/outputs passed between steps (in-memory vs. references)?
- Handling large payloads and parallel branches.
Schemas and Validation
- Define request/response schemas and validation strategy across services.
- Versioning and backward compatibility.
Intermediate Storage and Caching
- Where to store intermediate artifacts and how to cache reusable results (e.g., embeddings by content hash)?
- TTLs and invalidation.
Final Results Storage
- How to persist final outputs for retrieval, analytics, and audit.
- Retention and multi-tenant partitioning.
Model and Version Management
- Model registry, version pinning, canary/AB testing, and rollback.
Failure Handling and Retries
- Timeouts, retries with backoff/jitter, idempotency, partial failures, and fallbacks.
Monitoring and Traceability
- Metrics, logs, and distributed tracing across the workflow.
- Per-tenant visibility and cost/QPS attribution.
Scaling and Deployment
- Autoscaling model services (CPU/GPU), concurrency controls, warm pools.
- Multi-region, high availability, and deployment strategies.
Multi-tenant Auth and Rate Limiting

Authentication/authorization, quotas, rate limiting, and isolation.

RPC vs. REST Between Services

Justify when to use RPC versus REST for internal calls.
Discuss trade-offs in latency, schema evolution, observability, and backward compatibility.

Be explicit about assumptions and provide rationales for key choices.

Design an ML inference orchestration platform

Quick Overview

System Design: ML Inference Orchestration Platform

Context

Task

Solution

Comments (0)