This question evaluates a candidate's ability to design asynchronous batch inference APIs and systems, including API schema and job lifecycle design, idempotency semantics, queueing and worker scaling, batching and accelerator utilization, rate limiting, observability, and error handling.

You are designing an asynchronous inference service where clients submit a single item or a batch of items for model inference. The service should immediately acknowledge submission with a job ID and allow clients to poll for status and results later. No streaming of results is required.
Assume a typical cloud environment and standard components are available (HTTP gateway, object storage, message queues, autoscaling, etc.).
Login required