System Design: Global Photo/Video File Storage and Sharing ("CloudDrive")
Context
Design a scalable, highly reliable consumer service where users upload, store, view, and share photos/videos from mobile and web clients. The product supports private storage, shared folders/links, previews (thumbnails, transcodes), search, and versioning. The user base is global with diurnal traffic peaks. Assume a freemium model with both private and publicly shared links.
Task
Specify and justify a production-ready design across the following:
(a) Functional and Non-Functional Requirements
-
Functional: user onboarding, upload (single and multipart), download/stream, list/browse, share (link- and user-based), versioning, delete/restore, search, thumbnails/transcoding, quotas/billing, audit logs.
-
Non-Functional: availability, latency SLOs, throughput targets, consistency model (e.g., read-after-write for own uploads), durability, privacy/compliance.
(b) High-Level Architecture
-
Clients, API gateway, services, data stores, async processing, messaging/streaming, CDN/edge. Include where stateless/stateful boundaries lie.
(c) Core APIs and Data Models
-
Define key REST/gRPC APIs (upload init, part upload, complete, get/download, list, share, delete, restore, search).
-
Sketch essential data models (User, Object, ObjectVersion, Folder, ACL/Share, UploadSession, AuditEvent).
(d) Partitioning, Replication, Consistency
-
How to shard metadata and objects; replication across AZs/regions; consistency choices for metadata vs object blobs.
-
Transactions across services, idempotency for retries, schema/version evolution.
(e) Caching Strategy
-
Client caches (ETag/If-None-Match), edge/CDN (signed URLs, TTLs, invalidation), server-side caches (Redis) for hot metadata.
(f) Load Balancing, Routing, Autoscaling
-
Global traffic routing, L7 load balancing, service discovery, scaling policies for stateless APIs and workers.
(g) Failure Handling, Backpressure, Disaster Recovery
-
Timeouts, retries with exponential backoff and jitter, circuit breakers, queue-based backpressure.
-
DR plan with RPO/RTO and multi-region strategy (active-active or active-passive).
(h) Observability, Rate Limiting, Security
-
Metrics/logs/traces, SLO monitoring, alerts.
-
Rate limiting (per-user/IP), abuse detection.
-
Security: authn/authz, encryption in transit/at rest, KMS, key rotation, secure sharing.
(i) Capacity Planning, Cost, Scaling Roadmap
-
Estimate storage, QPS, bandwidth. Provide formulas, back-of-envelope numbers, and cost trade-offs (hot vs cold tiers, replication vs erasure coding).
-
Phased roadmap from MVP to multi-region scale.
(j) Risks and Mitigations
-
Identify key bottlenecks and failure modes; propose mitigations.