System Design: End-to-End Image Object-Detection Service
Context
Design a production-grade service that ingests user-uploaded images, runs object detection models, and returns detections via APIs. Assume both synchronous (low-latency) and asynchronous (high-throughput) use cases. If you need concrete numbers to reason about trade-offs, you may assume a moderate scale (e.g., 1–5k RPS peak, average image ~1 MB, typical image sizes 640–2048 px on the long side), but state any assumptions you make.
Requirements
Specify the following:
-
Functional requirements
-
Public APIs to submit images and retrieve detections
-
Synchronous detection for small/latency-sensitive requests
-
Asynchronous detection for large images/bulk loads
-
Idempotency, pagination, authentication/authorization
-
Result formats (bounding boxes, classes, confidences; optional masks)
-
Non-functional requirements
-
Accuracy targets (e.g., mAP@0.5, mAP@[0.5:0.95])
-
Latency SLOs (p50/p95 for sync vs. async)
-
Throughput targets (RPS or jobs/sec)
-
Availability (e.g., 99.9%+), durability, cost constraints
High-Level Architecture
Describe at a high level:
-
Ingestion (upload endpoints, pre-signed URLs)
-
Storage (object store for images, DB for metadata/results)
-
Preprocessing pipeline (resize/normalize/EXIF/format conversion)
-
Model serving tier (GPU inference, batching)
-
Asynchronous workers and queues (with DLQs/backpressure)
-
APIs for submit/status/results
-
Observability (metrics/logs/traces)
Data/Version Management
-
Model registry, dataset versioning, schema evolution
-
Reproducible training and rollbacks (model and data)
Performance/Operations
-
Batching strategy, GPU utilization, concurrency
-
Autoscaling strategy (request- and queue-driven)
-
Caching strategies (results, model artifacts)
Modeling & ML Ops
-
Model choices: single-stage vs two-stage, and when to use each
-
Training and labeling pipeline (active learning, QA)
-
Evaluation metrics and validation gates
-
Online/offline monitoring (drift, quality, SLIs/SLOs)
-
A/B testing and rollout/guardrails
Reliability & Compliance
-
Failure modes, retries, backpressure, timeouts, circuit breakers
-
Privacy, compliance, data retention, regionality
-
Cost controls (GPU choice, right-sizing, spotting)
-
Deployment strategy (blue/green, canary, rollback)