This question evaluates competence in large-scale system design, including distributed deduplication indexing, batch and incremental processing, concurrency correctness, safety and rollback mechanisms, reliability and observability, multi-tenant security, and cost-throughput capacity planning.
Design a production system that detects and removes duplicate or near-duplicate photos across tens of millions of files stored in multiple storage locations (e.g., object stores, NAS, user devices) and processed by multiple machines. Assume batch and incremental runs, multi-tenant data, and strict safety/observability requirements.
Cover the following decisions and trade-offs:
Login required