Scenario
Design a service that lets users upload an image containing text (e.g., a menu, screenshot, street sign), and returns the same image with the text translated into a target language (or multiple target languages).
Requirements
Functional
-
Upload an image and select
target language(s)
.
-
System performs
OCR
to detect text and bounding boxes.
-
Translate detected text into target language(s).
-
Return:
-
A
translated image
(text rendered back onto the image), and/or
-
A structured result: detected text blocks, bounding boxes, original text, translated text.
-
Support common image formats (JPG/PNG). Handle large images.
Non-functional
-
Low latency for interactive usage (single image).
-
Scalable for bursts (batch-like spikes).
-
Reliability: retries, idempotency, job tracking.
-
Security & privacy: images may contain sensitive content.
-
Observability: metrics, logs, tracing.
Deep-dive areas to be ready for
-
Frontend ↔ backend communication and security (authn/authz, request integrity).
-
Data model: what tables/documents exist, what each is used for.
-
How you store images and derived artifacts (OCR outputs, translations, rendered images).
-
Handling multiple target languages per image and caching.