System Design: Production-Ready File Deduplication Service
Context
Design a multi-tenant cloud service that stores files and achieves space savings via deduplication. The service must handle large scale (billions of files, petabytes of data), support high availability across regions, and provide operationally safe mechanisms for change management.
Assume:
-
Files are immutable once written (new versions create new files/manifests).
-
Deduplication happens at the chunk level using content-defined chunking.
-
Storage is an object store; metadata and indexes use scalable data stores.
Requirements
Outline the following:
-
Architecture
-
Ingest, chunking, indexing, storage, and metadata layers.
-
How manifests map files to chunks.
-
APIs
-
Read/write, streaming/multipart, idempotency, and admin/maintenance endpoints.
-
Read/Write Workflows
-
End-to-end flows, including parallelism and error handling.
-
Consistency and Safety
-
Consistency model, idempotency strategies, fault isolation, failure recovery, and disaster recovery (RPO/RTO).
-
Maintenance Operations
-
Backfills, compaction/garbage collection, index sharding and rebalancing, and safe rollout/rollback and schema/version migrations.
-
Operations and SRE
-
Monitoring, alerting, SLOs; capacity planning and cost controls (compute, storage, network).
-
Privacy and Compliance
-
Encryption, access control, GDPR/data deletion, residency, auditing.
-
Minimizing Production Impact
-
Rate limiting, backpressure, priority queues, and other isolation mechanisms.