Design a scalable photo deduplication service
Company: Abnormal Security
Role: Software Engineer
Category: System Design
Difficulty: hard
Interview Round: Technical Screen
How would you productionize the duplicate-photo removal algorithm for tens of millions of files across multiple machines and storage locations? Cover: choosing a distributed KV store (e.g., Redis) or alternatives for the dedupe index, key schema, memory sizing, and eviction; a batch pipeline (e.g., MapReduce/Spark) for scanning and hashing at scale, partitioning strategy and data locality; idempotency, exactly-once processing, and race conditions when workers overlap; verification before deletion, canarying, soft-delete/restore, and audit logs; fault tolerance, retries, backpressure, and monitoring (latency, throughput, error rates); incremental runs for newly added files and periodic re-hashing; security/permissions and multi-tenant isolation; and cost and throughput estimates.
Quick Answer: This question evaluates competence in large-scale system design, including distributed deduplication indexing, batch and incremental processing, concurrency correctness, safety and rollback mechanisms, reliability and observability, multi-tenant security, and cost-throughput capacity planning.