PracHub
QuestionsPremiumLearningGuidesCheatsheetNEWCoaches
|Home/System Design/Amazon

Design a replicated cloud storage service

Last updated: Mar 29, 2026

Quick Overview

This question evaluates a candidate's system-design competency in distributed cloud storage, covering metadata/data separation, replication models, read/write paths, failure recovery, and operational trade-offs.

  • hard
  • Amazon
  • System Design
  • Software Engineer

Design a replicated cloud storage service

Company: Amazon

Role: Software Engineer

Category: System Design

Difficulty: hard

Interview Round: Technical Screen

Design the internals of a cloud storage service (object/blob storage). Focus on storage/infra concerns rather than end-user features. Cover the following: 1. **High-level architecture** - Separate **data plane** (serving reads/writes of blobs) vs **control/metadata plane** (namespaces, object locations, versions, ACLs). - Key services/components you would expect (frontends, metadata service, storage nodes, background repair, monitoring). 2. **Metadata vs data relationship** - What metadata is stored (object ID, size, checksums, replication state, versioning, location pointers). - How metadata points to data chunks/segments and how you avoid metadata becoming a bottleneck. 3. **Replication model** - Choose a replication approach (e.g., primary/secondary, quorum replication, chain replication, erasure coding) and justify it. - Define durability and availability goals (e.g., tolerate N failures) and what “commit” means. 4. **Write path and read path** - Step-by-step request flow for a PUT/WRITE and GET/READ. - When you acknowledge a write to the client. - Caching and hot-object optimizations (optional). 5. **Trade-offs** - How you balance **durability vs performance** (sync vs async replication, quorum size, batching). - Consistency choices (strong/eventual) and how clients observe them. 6. **Failure handling and recovery** - Node failure detection, re-replication/reconstruction, data scrubbing, and recovery workflow. - What happens during partial failures (e.g., one replica slow, metadata unavailable).

Quick Answer: This question evaluates a candidate's system-design competency in distributed cloud storage, covering metadata/data separation, replication models, read/write paths, failure recovery, and operational trade-offs.

Related Interview Questions

  • Design a cloud database write path and recovery - Amazon (hard)
  • Measure platform success and drive adoption - Amazon (medium)
  • Design multi-tenant ingestion and processing platform - Amazon (medium)
  • Design globally consistent metadata service - Amazon (medium)
  • Design a large-scale temperature sensor system - Amazon (medium)
Amazon logo
Amazon
Jan 22, 2026, 12:00 AM
Software Engineer
Technical Screen
System Design
9
0
Loading...

Design the internals of a cloud storage service (object/blob storage). Focus on storage/infra concerns rather than end-user features.

Cover the following:

  1. High-level architecture
    • Separate data plane (serving reads/writes of blobs) vs control/metadata plane (namespaces, object locations, versions, ACLs).
    • Key services/components you would expect (frontends, metadata service, storage nodes, background repair, monitoring).
  2. Metadata vs data relationship
    • What metadata is stored (object ID, size, checksums, replication state, versioning, location pointers).
    • How metadata points to data chunks/segments and how you avoid metadata becoming a bottleneck.
  3. Replication model
    • Choose a replication approach (e.g., primary/secondary, quorum replication, chain replication, erasure coding) and justify it.
    • Define durability and availability goals (e.g., tolerate N failures) and what “commit” means.
  4. Write path and read path
    • Step-by-step request flow for a PUT/WRITE and GET/READ.
    • When you acknowledge a write to the client.
    • Caching and hot-object optimizations (optional).
  5. Trade-offs
    • How you balance durability vs performance (sync vs async replication, quorum size, batching).
    • Consistency choices (strong/eventual) and how clients observe them.
  6. Failure handling and recovery
    • Node failure detection, re-replication/reconstruction, data scrubbing, and recovery workflow.
    • What happens during partial failures (e.g., one replica slow, metadata unavailable).

Solution

Show

Comments (0)

Sign in to leave a comment

Loading comments...

Browse More Questions

More System Design•More Amazon•More Software Engineer•Amazon Software Engineer•Amazon System Design•Software Engineer System Design
PracHub

Master your tech interviews with 7,500+ real questions from top companies.

Product

  • Questions
  • Learning Tracks
  • Interview Guides
  • Resources
  • Premium
  • For Universities
  • Student Access

Browse

  • By Company
  • By Role
  • By Category
  • Topic Hubs
  • SQL Questions
  • Compare Platforms
  • Discord Community

Support

  • support@prachub.com
  • (916) 541-4762

Legal

  • Privacy Policy
  • Terms of Service
  • About Us

© 2026 PracHub. All rights reserved.