This question evaluates a candidate's ability to design asynchronous batch inference APIs and systems, including API schema and job lifecycle design, idempotency semantics, queueing and worker scaling, batching and accelerator utilization, rate limiting, observability, and error handling.
Design an inference service API where clients POST a job and later poll for results. Requirements: accept single or batch inputs; return a job ID on submission; provide status endpoints (queued, running, succeeded, failed); no streaming required. Specify request/response schemas, idempotency keys, timeout and retry behavior, and rate limits. Describe the job queue, workers, and storage of intermediate and final results; how you would scale workers, batch efficiently, and utilize accelerators; and how you would implement observability, error handling, and partial failures within a batch.
Quick Answer: This question evaluates a candidate's ability to design asynchronous batch inference APIs and systems, including API schema and job lifecycle design, idempotency semantics, queueing and worker scaling, batching and accelerator utilization, rate limiting, observability, and error handling.
Design an Asynchronous (POST-and-Poll) Inference Service API
Design an asynchronous inference service for serving model predictions. A client submits a single item or a batch of items in one request, immediately receives a job ID, and later polls for status and results. There is no streaming of results back to the client and no synchronous result on submission — submission only acknowledges the work and hands back a handle to track it.
Walk through the API contract, the backing architecture, and how you would operate the service in production. Assume a typical cloud environment with standard building blocks available (HTTP gateway, object storage, message queues, autoscaling, etc.).
Constraints & Assumptions
These anchor the design; treat any number you'd want firmed up as a clarifying question rather than a hard SLA.
Workload
: GPU-served models (e.g., language or vision), where one inference is the expensive step (tens of milliseconds to seconds per item) rather than the I/O.
Async is intentional
: clients tolerate end-to-end latency from seconds to minutes, which is what makes queuing, batching, and autoscaling worthwhile.
Multi-tenant
: many API keys share the fleet; one tenant must not starve or crash another.
Batch sizes
: a single item up to roughly
104
items per job; larger workloads are chunked or referenced by file.
Result durability
: results are retained for a bounded TTL (e.g., 7 days), then purged.
Standard cloud primitives are available; you do not need to build a queue or object store from scratch.
Clarifying Questions to Ask
A strong candidate scopes the problem before designing. Good questions here include:
What is the
read:write ratio
? (Polling reads vastly outnumber submissions — this shapes caching and rate limits.)
What
latency SLA
matters — time-to-acknowledge on submit, or end-to-end time-to-result? What batch sizes dominate in practice?
Are models
single-pass
, or are there
multi-stage pipelines
with intermediate artifacts to persist?
What are the
payload sizes
(per item and per batch), and the
retention
requirements for inputs and results?
What
delivery guarantee
is required — is at-least-once processing with idempotent results acceptable, or is exactly-once mandatory?
What is the
peak request rate
and the
tenant isolation
requirement (noisy-neighbor tolerance)?
Address each of the four parts below. Per-part hints are click-to-reveal — try the part first, then expand a hint if you're stuck.
Part 1 — API behavior
Accept
single or batch
inputs through the same submission endpoint.
Return a
job ID immediately
on submission.
Expose
status endpoints
that report one of four states:
queued
,
running
,
succeeded
,
failed
.
Polling only
— no streaming response is required.
State precisely what each of the four statuses means, including how they behave for a mixed batch where some items succeed and others fail.
What This Part Should Cover
Unified contract
: a single submit path for one item or many, returning a
job_id
immediately (the submit is an acknowledgement, not a result).
Precise state definitions
: what each of
queued | running | succeeded | failed
means and the exact transition that makes a job terminal.
Partial-failure semantics
: a defensible rule for a mixed batch that neither discards good results nor hides failures (e.g., an extra job-summary signal beyond the four states).
Part 2 — API design
Specify the
request/response schemas
for submission, status, and results.
Define
idempotency keys and their semantics
(so a retried submission doesn't create duplicate work).
Define
timeout and retry behavior
on both the
client
and
server
side.
Define
rate limits
and
backpressure
behavior.
What This Part Should Cover
Concrete schemas
: well-shaped request/response bodies for submit, status, per-item results, and a consolidated result — with the status read kept deliberately small.
Exactly-once job creation
: idempotency keyed on caller + key, race-safe under concurrent retries, with a defined behavior when the same key arrives with a different payload.
Retry contract on both sides
: which status codes are retryable, client backoff with the same idempotency key, and server-side per-item / per-job timeouts.
Two distinct overload signals
: per-key rate limiting vs system-wide backpressure, each with a different response code so the client knows whether to slow itself or simply wait.
Part 3 — Architecture
Describe the
job queue
, the
workers
, and the
storage
of inputs, intermediate artifacts, and final results.
Explain how you would
scale workers
,
batch efficiently
, and
utilize accelerators
(e.g., GPUs).
What This Part Should Cover
Decoupled topology
: stateless CPU ingress, a durable queue, and GPU workers that scale independently, with the at-least-once delivery semantics that implies.
Storage split
: large blobs (inputs, intermediates, results) in object storage; only pointers and small metadata in the DB; a clear home for multi-stage intermediates.
Accelerator utilization
: dynamic/micro-batching with a fire trigger, shape bucketing to bound padding waste, continuous batching for generative decoding, and warm weights.
Scaling signal
: the autoscaler watches a
leading
indicator (queue depth / target wait) rather than only GPU utilization, plus priority lanes and per-model pools.
Part 4 — Operability
Implement
observability
: metrics, logs, and tracing.
Define
error handling
and a
standardized error schema
.
Handle
partial failures within a batch
(some items succeed while others fail).
What This Part Should Cover
Actionable metrics
: latency percentiles plus efficiency signals (batch-size distribution, GPU utilization) that reveal whether batching works and whether GPUs are over- or under-provisioned.
One error envelope
: a stable machine-readable
code
and
retryable
flag everywhere, so clients branch on fields not prose.
Transient vs permanent handling
: retry-with-backoff for transient errors, immediate fail + DLQ for permanent ones.
Partial-failure roll-up in practice
: counts the client can act on, a way to list only failed items, and a consolidated artifact containing both successes and failures.
What a Strong Answer Covers
These cross-cutting dimensions span all four parts; an interviewer weighs them across the whole design (these are dimensions, not the answers):
Capacity reasoning
: a back-of-envelope estimate (items/sec, GPU-seconds per item, payload sizes) that
justifies
the architecture rather than reciting components.
Correctness under retries end-to-end
: exactly-once
job creation
via idempotency (Part 2) combined with at-least-once
processing
made safe by idempotent per-item write-back (Part 3) — the two halves must compose.
One coherent contract
: the API shapes (Part 2) match the state and partial-failure semantics (Parts 1 and 4) and the storage/result paths (Part 3) without contradiction.
Tradeoffs and isolation
: naming the cost of each major decision and when to revisit it, plus multi-tenant fairness and security (TLS, signed URLs, least privilege) that run through every part.
Follow-up Questions
How does the design change at
100x scale
(e.g., 100k items/sec) — what breaks first, ingress, the queue, the metadata DB, or GPU supply?
A client submits a
10k-item job
and polls aggressively. How do you keep status reads cheap and return results efficiently without paging through 10k items?
A worker
crashes after running inference but before acknowledging
the message. Walk through exactly what happens and why no result is double-counted or lost.
One
model version is repointed
mid-flight. How do in-flight jobs and later polls stay consistent about which model actually ran?