Design a batch inference API

Q: Design a batch inference API

This question evaluates a candidate's ability to design asynchronous batch inference APIs and systems, including API schema and job lifecycle design, idempotency semantics, queueing and worker scaling, batching and accelerator utilization, rate limiting, observability, and error handling.

Q: How do I approach ML System Design interview questions?

ML System Design questions require understanding of core concepts and practice. PracHub provides solutions with explanations to help you master ml system design interviews.

Question

System Design: Async Inference Service API (POST Job, Poll for Results)

Context

You are designing an asynchronous inference service where clients submit a single item or a batch of items for model inference. The service should immediately acknowledge submission with a job ID and allow clients to poll for status and results later. No streaming of results is required.

Requirements

API behavior
- Accept single or batch inputs.
- On submission, return a job ID immediately.
- Provide status endpoints with states: queued, running, succeeded, failed.
- No streaming response is required (polling only).
API design
- Specify request/response schemas for submission, status, and results.
- Include idempotency keys and semantics.
- Define timeout and retry behavior (client and server side).
- Define rate limits and backpressure behavior.
Architecture
- Describe the job queue, workers, and storage of inputs, intermediate, and final results.
- Explain how to scale workers, batch efficiently, and utilize accelerators (e.g., GPUs).
Operability
- Implement observability (metrics, logs, tracing).
- Error handling and standardized error schema.
- Handling of partial failures within a batch.

Assume a typical cloud environment and standard components are available (HTTP gateway, object storage, message queues, autoscaling, etc.).

Design a batch inference API

System Design: Async Inference Service API (POST Job, Poll for Results)

Context

Requirements

Solution

Comments (0)

Design a batch inference API

Overview

System Design: Async Inference Service API (POST Job, Poll for Results)

Context

Requirements

Solution

Comments (0)