Design a replicated cloud storage service
Company: Amazon
Role: Software Engineer
Category: System Design
Difficulty: hard
Interview Round: Technical Screen
Design the internals of a cloud storage service (object/blob storage). Focus on storage/infra concerns rather than end-user features.
Cover the following:
1. **High-level architecture**
- Separate **data plane** (serving reads/writes of blobs) vs **control/metadata plane** (namespaces, object locations, versions, ACLs).
- Key services/components you would expect (frontends, metadata service, storage nodes, background repair, monitoring).
2. **Metadata vs data relationship**
- What metadata is stored (object ID, size, checksums, replication state, versioning, location pointers).
- How metadata points to data chunks/segments and how you avoid metadata becoming a bottleneck.
3. **Replication model**
- Choose a replication approach (e.g., primary/secondary, quorum replication, chain replication, erasure coding) and justify it.
- Define durability and availability goals (e.g., tolerate N failures) and what “commit” means.
4. **Write path and read path**
- Step-by-step request flow for a PUT/WRITE and GET/READ.
- When you acknowledge a write to the client.
- Caching and hot-object optimizations (optional).
5. **Trade-offs**
- How you balance **durability vs performance** (sync vs async replication, quorum size, batching).
- Consistency choices (strong/eventual) and how clients observe them.
6. **Failure handling and recovery**
- Node failure detection, re-replication/reconstruction, data scrubbing, and recovery workflow.
- What happens during partial failures (e.g., one replica slow, metadata unavailable).
Quick Answer: This question evaluates a candidate's system-design competency in distributed cloud storage, covering metadata/data separation, replication models, read/write paths, failure recovery, and operational trade-offs.