Cloud File Storage Service

What's being tested

This tests whether you can design a production-grade distributed storage system with clean API boundaries, durable object storage, strongly modeled metadata, and well-scoped consistency guarantees. The interviewer is probing for how you separate file bytes from metadata, handle concurrency for operations like rename and move, support large resumable uploads, and reason about failures without hand-waving. Harvey cares because legal and enterprise workflows depend on correctness around documents: losing a file, exposing the wrong version, or corrupting folder state is unacceptable. A strong Software Engineer answer should show practical distributed-systems judgment: which components need strong consistency, which can be eventually consistent, and where to use transactions, idempotency, and background repair.

Core knowledge

Storage/metadata separation is the starting architecture: store file contents as immutable blobs in `S3`, `GCS`, or an internal object store, and store directory trees, permissions, versions, and upload state in a transactional metadata service like `Postgres`, `MySQL`, `Spanner`, or `DynamoDB`.
Object keys should not be user-visible paths. Use stable IDs such as `file_id`, `version_id`, and `blob_id`; /client/matter/doc.pdf is metadata. This prevents rename from requiring object rewrites and lets the same blob be referenced by multiple versions or deduplicated.
Directory modeling usually uses either an adjacency list table, with `parent_id` and `name`, or a materialized path / closure table for faster subtree queries. Adjacency lists are simple and transactional; materialized paths make subtree listing easier but complicate renames and moves.
Uniqueness constraints are essential for correctness. A typical table has UNIQUE(parent_id, name) to prevent two entries with the same name in one folder. For case-insensitive filesystems, normalize names into `normalized_name` and enforce UNIQUE(parent_id, normalized_name).
Atomic rename/move should be a metadata transaction, not a blob operation. Update `parent_id`, `name`, `updated_at`, and possibly path cache inside one database transaction. Validate permissions, destination existence, name uniqueness, and cycle prevention before commit.
Per-directory limits need concurrency-safe enforcement. If a folder has max N children, do not read COUNT(*) and then insert without protection. Use a locked counter row, SELECT ... FOR UPDATE, serializable isolation, or a conditional update like UPDATE folders SET child_count = child_count + 1 WHERE id = ? AND child_count < N.
Transaction boundaries should keep external object storage out of database transactions. Common flow: create upload session in metadata, upload bytes to object storage, verify checksum, then commit a file version referencing the blob. Use a garbage collector for orphaned uploaded blobs.
Idempotency keys protect create/upload/commit APIs from client retries. Store (user_id, operation, idempotency_key) -> response with a TTL or durable record. Stripe’s idempotency-key pattern is useful: the same key and parameters return the same result; parameter mismatch returns an error.
Large-file uploads should support multipart upload and resumability. Split files into chunks, track uploaded parts with part numbers, byte ranges, ETags, and checksums. Finalize only after all required parts are present and the full-file checksum matches, e.g. SHA-256(file) == expected_hash.
Versioning is usually append-only. files represents the logical document, file_versions represents immutable content snapshots, and current_version_id points to the latest committed version. This supports rollback, audit history, collaborative edits, and safe reads during concurrent writes.
Consistency model should be explicit. Metadata operations like create, rename, delete, and permission changes generally require strong consistency. Blob reads can be served from object storage or CDN with cache validation, but clients should resolve file identity and version through strongly consistent metadata.
Sync clients need change tracking, not recursive polling. Maintain a monotonically increasing change_seq or per-namespace log of events: create, update, rename, delete, permission change. Clients call GET /changes?cursor=...; if the cursor is too old, force a full resync.

Worked example

For Design a production file storage service, start by clarifying scope: “Are we designing Dropbox-like user storage, enterprise document storage, or a backend service? Do we need folders, rename, permissions, file versioning, and per-directory limits? What scale should I assume for file count, object size, and QPS?” Then declare assumptions: file bytes are large and immutable, metadata must be strongly consistent, and object storage is durable but not transactional with the metadata database.

A strong answer can be organized around four pillars: API design, metadata schema, blob storage/upload flow, and concurrency/failure handling. For APIs, define CreateFolder, StartUpload, UploadPart, CommitUpload, Rename, Move, ListFolder, GetDownloadUrl, and Delete. For schema, separate nodes or files/folders, file_versions, blobs, upload_sessions, and idempotency_keys, with constraints like UNIQUE(parent_id, normalized_name).

The key tradeoff to flag is metadata consistency versus storage scalability: keep rename and directory limits in a transactional database, while storing bytes externally in `S3`-style storage and reconciling through background cleanup. When discussing per-directory limits, call out the race condition explicitly: two concurrent creates can both see 999 children and insert, so the limit must be enforced with a locked counter or conditional write. Close by saying that, with more time, you would add permission inheritance, audit logging, cross-region replication, and disaster recovery testing.

A second angle

For Design a Cloud File Storage Service, the same architecture applies, but the emphasis shifts toward client synchronization, collaboration, and user-visible behavior. You still separate immutable blobs from strongly consistent metadata, but now you should spend more time on GET /changes, conflict handling, offline edits, and version history. A rename might be represented as a metadata event rather than a delete-plus-create, so sync clients preserve identity and avoid re-downloading unchanged bytes. Resumable upload becomes more prominent because cloud clients may upload multi-GB files over unreliable networks. The design should state whether last-writer-wins is acceptable or whether conflicting versions are preserved as separate file versions.

Common pitfalls

Pitfall: Treating paths as the source of truth.

A tempting answer is to store files directly under keys like user_id/folder/doc.pdf and “rename” by copying objects. That fails for large files, concurrent renames, and version history. A better answer uses stable IDs for logical files and treats paths as mutable metadata.

Pitfall: Saying “use S3 and Postgres” without defining invariants.

The interviewer is not testing whether you know popular services; they are testing whether you can preserve correctness under retries, partial failures, and concurrent operations. State invariants such as “no duplicate child names under the same parent,” “a committed version always references an existing blob,” and “a folder’s child count never exceeds its configured limit.”

Pitfall: Over-indexing on scalability while ignoring atomicity.

Some candidates jump immediately to sharding, CDNs, and queues, then miss the hard correctness problem in rename, move, and create. Scalability matters, but for this system the sharp edge is transactional metadata. First design one correct shard or namespace, then explain how to partition by tenant_id, drive_id, or root folder once the invariants are clear.

Connections

Interviewers can pivot from this into distributed transactions, object storage internals, sync protocols, access control, or multi-region replication. Be ready to compare strong consistency versus eventual consistency, discuss cache invalidation for download URLs, and explain how audit logs or event streams support search indexing and client sync without becoming the source of truth.

What's being tested

Core knowledge

Worked example

A second angle

Common pitfalls

Connections

Further reading

Featured in interview prep guides

Practice questions

Related concepts