Design a scalable, reliable system

Q: Design a scalable, reliable system

This is a System Design interview question from Anthropic for Software Engineer roles. View the full question and solution on PracHub.

Q: How do I approach System Design interview questions?

System Design questions require understanding of core concepts and practice. PracHub provides solutions with explanations to help you master system design interviews.

Question

Context

Design a scalable, highly reliable consumer service where users upload, store, view, and share photos/videos from mobile and web clients. The product supports private storage, shared folders/links, previews (thumbnails, transcodes), search, and versioning. The user base is global with diurnal traffic peaks. Assume a freemium model with both private and publicly shared links.

Task

Specify and justify a production-ready design across the following:

(a) Functional and Non-Functional Requirements

Functional: user onboarding, upload (single and multipart), download/stream, list/browse, share (link- and user-based), versioning, delete/restore, search, thumbnails/transcoding, quotas/billing, audit logs.
Non-Functional: availability, latency SLOs, throughput targets, consistency model (e.g., read-after-write for own uploads), durability, privacy/compliance.

(b) High-Level Architecture

Clients, API gateway, services, data stores, async processing, messaging/streaming, CDN/edge. Include where stateless/stateful boundaries lie.

(c) Core APIs and Data Models

Define key REST/gRPC APIs (upload init, part upload, complete, get/download, list, share, delete, restore, search).
Sketch essential data models (User, Object, ObjectVersion, Folder, ACL/Share, UploadSession, AuditEvent).

(d) Partitioning, Replication, Consistency

How to shard metadata and objects; replication across AZs/regions; consistency choices for metadata vs object blobs.
Transactions across services, idempotency for retries, schema/version evolution.

(e) Caching Strategy

Client caches (ETag/If-None-Match), edge/CDN (signed URLs, TTLs, invalidation), server-side caches (Redis) for hot metadata.

(f) Load Balancing, Routing, Autoscaling

Global traffic routing, L7 load balancing, service discovery, scaling policies for stateless APIs and workers.

(g) Failure Handling, Backpressure, Disaster Recovery

Timeouts, retries with exponential backoff and jitter, circuit breakers, queue-based backpressure.
DR plan with RPO/RTO and multi-region strategy (active-active or active-passive).

(h) Observability, Rate Limiting, Security

Metrics/logs/traces, SLO monitoring, alerts.
Rate limiting (per-user/IP), abuse detection.
Security: authn/authz, encryption in transit/at rest, KMS, key rotation, secure sharing.

(i) Capacity Planning, Cost, Scaling Roadmap

Estimate storage, QPS, bandwidth. Provide formulas, back-of-envelope numbers, and cost trade-offs (hot vs cold tiers, replication vs erasure coding).
Phased roadmap from MVP to multi-region scale.

(j) Risks and Mitigations

Identify key bottlenecks and failure modes; propose mitigations.

Design a scalable, reliable system

Context

Task

Solution (Locked)

Comments (0)

Design a scalable, reliable system

System Design: Global Photo/Video File Storage and Sharing ("CloudDrive")

Context

Task

Solution (Locked)

Comments (0)