Design a scalable photo deduplication service

Q: Design a scalable photo deduplication service

This is a System Design interview question from Abnormal Security for Software Engineer roles. View the full question and solution on PracHub.

Q: How do I approach System Design interview questions?

System Design questions require understanding of core concepts and practice. PracHub provides solutions with explanations to help you master system design interviews.

Question

System Design: Productionizing Duplicate-Photo Removal at Scale

Context

Design a production system that detects and removes duplicate or near-duplicate photos across tens of millions of files stored in multiple storage locations (e.g., object stores, NAS, user devices) and processed by multiple machines. Assume batch and incremental runs, multi-tenant data, and strict safety/observability requirements.

Requirements

Cover the following decisions and trade-offs:

Dedupe index
- Choose a distributed KV/DB (e.g., Redis vs alternatives) for the dedupe index.
- Key schema, memory sizing, and eviction strategy.
Batch pipeline
- Use MapReduce/Spark (or similar) to scan and hash at scale.
- Partitioning strategy and data locality across storage locations/regions.
Correctness under concurrency
- Idempotency, exactly-once processing, and handling race conditions when workers overlap.
Safety and rollout
- Verification before deletion, canarying, soft-delete/restore workflow, and immutable audit logs.
Reliability
- Fault tolerance, retries, backpressure, and monitoring (latency, throughput, error rates).
Incrementality
- Incremental runs for newly added files and periodic re-hashing when algorithms change.
Security and isolation
- Permissions model and multi-tenant isolation.
Cost and throughput
- Provide rough cost and throughput estimates and capacity sizing assumptions.

Design a scalable photo deduplication service

System Design: Productionizing Duplicate-Photo Removal at Scale

Context

Requirements

Solution

Comments (0)