Design a cloud object storage service similar to Amazon S3. The service should allow clients to upload, store, and download large files reliably and efficiently.
Focus your design on the following aspects:
-
API Design
-
Define high-level REST APIs for:
-
Uploading an object (e.g.,
PUT /buckets/{bucketId}/objects/{objectKey}
)
-
Downloading an object (e.g.,
GET /buckets/{bucketId}/objects/{objectKey}
)
-
Optionally listing objects in a bucket.
-
Consider authentication, basic metadata handling (e.g., size, content-type), and how clients reference objects (buckets and keys).
-
File Splitting / Multipart Upload
-
Large files (e.g., several GBs) should be uploadable in parts.
-
Explain how you would:
-
Split files into chunks/parts on the client or server.
-
Track upload progress and handle retries for failed parts.
-
Reassemble parts into a final object.
-
Discuss trade-offs in chunk size and how to ensure consistency and integrity (e.g., checksums).
-
Backend Storage and Replication
-
Design how the service stores object data and metadata:
-
Object data storage layer (e.g., distributed file system or key-value storage).
-
Metadata storage (e.g., mapping from bucket/key to physical locations, size, checksums, replication info).
-
Explain how you will replicate data across multiple machines and data centers to handle:
-
Machine failures.
-
Data center outages.
-
Describe strategies for:
-
Data durability (e.g., replication factor, erasure coding).
-
Consistency model (eventual vs strong) for reads after writes.
-
Failure Handling and Disaster Recovery
-
Describe what happens if a data center goes down:
-
How does the system continue serving reads and writes?
-
How do you detect failures and route traffic to healthy regions?
-
Discuss backup, restore, and how you ensure no data loss (or minimal data loss) in catastrophic failures.
-
Scalability and Performance
-
How would you design the system to handle:
-
Many concurrent uploads/downloads (e.g., millions of QPS)?
-
Large total storage size (e.g., petabytes or more)?
-
Explain choices like partitioning/sharding keys, load balancing, and caching.
Clearly state assumptions (e.g., target QPS, typical object sizes, durability requirements) and walk through the end-to-end flow of a typical upload and download request.