Design an S3-like object storage service
Company: Amazon
Role: Machine Learning Engineer
Category: System Design
Difficulty: medium
Interview Round: Onsite
Design a cloud object storage service similar to Amazon S3. The service should allow clients to upload, store, and download large files reliably and efficiently.
Focus your design on the following aspects:
1. **API Design**
- Define high-level REST APIs for:
- Uploading an object (e.g., `PUT /buckets/{bucketId}/objects/{objectKey}`)
- Downloading an object (e.g., `GET /buckets/{bucketId}/objects/{objectKey}`)
- Optionally listing objects in a bucket.
- Consider authentication, basic metadata handling (e.g., size, content-type), and how clients reference objects (buckets and keys).
2. **File Splitting / Multipart Upload**
- Large files (e.g., several GBs) should be uploadable in parts.
- Explain how you would:
- Split files into chunks/parts on the client or server.
- Track upload progress and handle retries for failed parts.
- Reassemble parts into a final object.
- Discuss trade-offs in chunk size and how to ensure consistency and integrity (e.g., checksums).
3. **Backend Storage and Replication**
- Design how the service stores object data and metadata:
- Object data storage layer (e.g., distributed file system or key-value storage).
- Metadata storage (e.g., mapping from bucket/key to physical locations, size, checksums, replication info).
- Explain how you will replicate data across multiple machines and data centers to handle:
- Machine failures.
- Data center outages.
- Describe strategies for:
- Data durability (e.g., replication factor, erasure coding).
- Consistency model (eventual vs strong) for reads after writes.
4. **Failure Handling and Disaster Recovery**
- Describe what happens if a data center goes down:
- How does the system continue serving reads and writes?
- How do you detect failures and route traffic to healthy regions?
- Discuss backup, restore, and how you ensure no data loss (or minimal data loss) in catastrophic failures.
5. **Scalability and Performance**
- How would you design the system to handle:
- Many concurrent uploads/downloads (e.g., millions of QPS)?
- Large total storage size (e.g., petabytes or more)?
- Explain choices like partitioning/sharding keys, load balancing, and caching.
Clearly state assumptions (e.g., target QPS, typical object sizes, durability requirements) and walk through the end-to-end flow of a typical upload and download request.
Quick Answer: This question evaluates understanding of distributed systems, object storage architecture, REST API design, multipart upload semantics, replication and durability strategies, failure recovery, and scalability concerns.