Design a scalable photo deduplication service

Q: Design a scalable photo deduplication service

This question evaluates competence in large-scale system design, including distributed deduplication indexing, batch and incremental processing, concurrency correctness, safety and rollback mechanisms, reliability and observability, multi-tenant security, and cost-throughput capacity planning.

Q: How do I approach System Design interview questions?

System Design questions require understanding of core concepts and practice. PracHub provides solutions with explanations to help you master system design interviews.

Q: What difficulty level is this interview question?

This is a hard difficulty System Design question, commonly asked during Technical Screen rounds at Abnormal Security.

Q: What role is this question designed for?

This question is commonly asked for Software Engineer candidates at Abnormal Security during technical interviews.

Question

System Design: Productionizing Duplicate-Photo Removal at Scale

Context

Design a production system that detects and removes duplicate or near-duplicate photos across tens of millions of files stored in multiple storage locations (e.g., object stores, NAS, user devices) and processed by multiple machines. Assume batch and incremental runs, multi-tenant data, and strict safety/observability requirements.

Requirements

Cover the following decisions and trade-offs:

Dedupe index
- Choose a distributed KV/DB (e.g., Redis vs alternatives) for the dedupe index.
- Key schema, memory sizing, and eviction strategy.
Batch pipeline
- Use MapReduce/Spark (or similar) to scan and hash at scale.
- Partitioning strategy and data locality across storage locations/regions.
Correctness under concurrency
- Idempotency, exactly-once processing, and handling race conditions when workers overlap.
Safety and rollout
- Verification before deletion, canarying, soft-delete/restore workflow, and immutable audit logs.
Reliability
- Fault tolerance, retries, backpressure, and monitoring (latency, throughput, error rates).
Incrementality
- Incremental runs for newly added files and periodic re-hashing when algorithms change.
Security and isolation
- Permissions model and multi-tenant isolation.
Cost and throughput
- Provide rough cost and throughput estimates and capacity sizing assumptions.

Design a scalable photo deduplication service

Quick Overview

System Design: Productionizing Duplicate-Photo Removal at Scale

Context

Requirements

Solution

Comments (0)