Design an image detection system

Q: Design an image detection system

This is a ML System Design interview question from Datadog for Software Engineer roles. View the full question and solution on PracHub.

Q: How do I approach ML System Design interview questions?

ML System Design questions require understanding of core concepts and practice. PracHub provides solutions with explanations to help you master ml system design interviews.

Question

System Design: Image Object Detection Service

Scenario

Design an image detection service that accepts user-uploaded images and returns detected objects with bounding boxes and confidence scores. The service must support both real-time online inference and high-throughput batch processing.

Clarify and Quantify Requirements

State assumptions if any numbers are missing, and justify them.

Traffic and performance

Online inference: target end-to-end latency (P50/P95/P99), expected and peak QPS, regional distribution, and acceptable tail behavior.
Batch processing: target throughput (images/sec), acceptable end-to-end SLA (e.g., minutes/hours), and concurrency.
Payloads: typical and max image size (KB/MB), formats (JPEG/PNG/WebP), and max resolution.

Quality

Accuracy metrics: required mAP@0.5 and mAP@[0.5:0.95], precision/recall, calibration targets.
Classes: expected number of object classes and class imbalance considerations.

Constraints

Cost budget, multi-tenancy/isolation needs, regions, and compliance requirements.

Deliverables

Propose and justify an end-to-end architecture that includes:

Ingestion

API gateway, auth (e.g., OAuth2/JWT), WAF, rate limiting/quotas, request validation.
Upload flow: direct-to-object-store with pre-signed URLs vs. through API.

Storage and data model

Object storage for raw/derived images; metadata database schema for requests, results, and model versions.

Preprocessing and postprocessing

Image validation, resizing/normalization, EXIF handling, virus scanning, and output formatting.

Model serving

GPU-backed serving, autoscaling policy, batching strategy, model formats (e.g., ONNX/TensorRT), and multi-model/version hosting.

Caching

Result caching strategy (keys, TTL), CDN considerations, and cache invalidation on model updates.

Asynchronous workflows

Queues/streams, idempotency, retries/DLQs for batch and overflow traffic.

Model lifecycle

Versioning, A/B testing or shadow traffic, rollout/rollback.

Monitoring and reliability

Metrics (latency percentiles, throughput, GPU utilization/memory, queue depth), drift detection, alerting SLOs, failure modes and fallbacks.

Offline training pipeline

Data labeling, augmentation, experiment tracking, evaluation, and packaging for serving.

Deployment and rollout

Blue/green or canary strategy; CI/CD and validation gates.

Cost/performance trade-offs

GPU types, batching, quantization, spot vs. on-demand, and multi-tenancy isolation.

Privacy, security, compliance

Data retention/deletion, encryption, access control, audit logging, regionalization, and acceptable use/content controls.

Provide diagrams if useful (ASCII is fine), capacity planning math, and a brief risk/mitigation section.

Design an image detection system

System Design: Image Object Detection Service

Scenario

Clarify and Quantify Requirements

Deliverables

Solution (Locked)

Comments (0)