System Design: Background Processing Backend for LLM Prompts
Context
Design a multi-tenant backend that processes large language model (LLM) prompts asynchronously. Clients submit prompts via an API and later poll for status/results or receive callbacks via webhooks. The system must support reliability, scale, and cost controls.
Requirements
-
APIs
-
Submit prompts (with idempotency keys), poll job status, fetch results, register webhooks/callbacks.
-
Job orchestration
-
Queueing, prioritization (e.g., realtime vs bulk), worker pools, retries, dead-letter queues (DLQ).
-
Model routing
-
Route requests to appropriate model/provider based on policy (latency/cost/quality/capacity).
-
Prompt versioning
-
Manage template versions and the exact prompt/model context used for reproducibility.
-
Idempotency
-
Ensure duplicate submissions do not create duplicate work/charges.
-
Retries and DLQ
-
Automatic retry with backoff; poison message handling.
-
Result storage
-
Store inputs/outputs/metadata, enable polling and callback delivery; set retention policies.
-
Observability
-
Metrics, logs, traces; per-tenant dashboards, alerting, audits.
-
Non-functionals
-
Scaling and capacity planning, cost control, rate limiting, PII/security, and SLAs/SLOs.
-
Follow-up
-
Support streaming partial outputs and cancellation of in-flight jobs.
Describe the architecture, data flows, and key design choices. Provide concrete API designs and operational policies.