System Design: Productionizing Duplicate-Photo Removal at Scale
Context
Design a production system that detects and removes duplicate or near-duplicate photos across tens of millions of files stored in multiple storage locations (e.g., object stores, NAS, user devices) and processed by multiple machines. Assume batch and incremental runs, multi-tenant data, and strict safety/observability requirements.
Requirements
Cover the following decisions and trade-offs:
-
Dedupe index
-
Choose a distributed KV/DB (e.g., Redis vs alternatives) for the dedupe index.
-
Key schema, memory sizing, and eviction strategy.
-
Batch pipeline
-
Use MapReduce/Spark (or similar) to scan and hash at scale.
-
Partitioning strategy and data locality across storage locations/regions.
-
Correctness under concurrency
-
Idempotency, exactly-once processing, and handling race conditions when workers overlap.
-
Safety and rollout
-
Verification before deletion, canarying, soft-delete/restore workflow, and immutable audit logs.
-
Reliability
-
Fault tolerance, retries, backpressure, and monitoring (latency, throughput, error rates).
-
Incrementality
-
Incremental runs for newly added files and periodic re-hashing when algorithms change.
-
Security and isolation
-
Permissions model and multi-tenant isolation.
-
Cost and throughput
-
Provide rough cost and throughput estimates and capacity sizing assumptions.