Adobe Creative Cloud Real-Time Collaboration And Offline Sync

What's being tested

Interviewers probe your ability to design a low-latency, scalable collaboration system that also supports robust offline sync and recovery. They expect concrete distributed-systems tradeoffs: replication/consistency choices, conflict-resolution algorithms, client sync protocols, storage/compaction strategies, and how the design meets latency, bandwidth, and operational constraints at Adobe scale.

Core knowledge

CRDT (Conflict-free Replicated Data Types) vs Operational Transformation (OT): CRDTs guarantee eventual convergence without coordination, useful for offline-first clients; OT often needs a central transform server for intention preservation and lower metadata but complicates offline merging.
Op-based vs state-based delta replication: op-based (send operations) minimizes bandwidth for small ops but requires causal delivery; state/delta-state CRDTs send compacted deltas and help with late joiners. Memory and metadata per-document grow as O(#ops) unless you snapshot/compact.
Causality & metadata: use vector clocks or compact version vectors to detect concurrent operations; Lamport timestamps can order events but don't detect concurrency. Metadata size is typically O(#replicas) unless you use dotted-version vectors or summarized clocks.
Client architecture: local-first store + change log: store edits locally in IndexedDB/SQLite, expose optimistic updates, and maintain a durable operation log (changepack) with checkpointing (last-acked sequence number) for efficient resync.
Sync protocol: clients send a changepack and a checkpoint (e.g., last-seen-op-id); server replies with missing ops or a snapshot diff. Bandwidth ≈ ops/sec * avg_op_size; for estimation use $BW = \sum_i s_i$ where $s_i$ is serialized op size.
Real-time transport & scale: use WebSocket/gRPC for low latency; scale hundreds of millions of connections via connection gateways, sticky routing, and per-document channels. For fan-out, prefer ephemeral pub/sub (e.g., Redis streams, custom gossip) over heavy-use Kafka for sub-100ms delivery.
Persistence strategy: event-sourcing (append-only op-store) + periodic snapshots for fast recovery. Compaction (merge ops into snapshot and GC tombstones) reduces storage and read latency but requires consistent snapshotting.
Large assets and partial replication: large binary files (images, PSD layers) should be stored in object storage (e.g., S3) with metadata/annotations as CRDTs; sync metadata only and fetch blobs on demand to avoid full-file transfer during edits.
Conflict resolution patterns: merge policies (last-writer-wins, CRDT merge, application-specific merge functions), tombstone handling, and undo/redo maintenance. Choose policies based on user expectations (intention preservation vs convergence).
Operational concerns: idempotency (idempotency-key), retry/backoff with jitter, rate-limiting, p99 latency SLOs, and observability (ops/sec, op-lag, divergence rate). For long offline windows, expect large changepacks and plan incremental checkpoints.

Worked example — "Design a real-time collaboration system for Creative Cloud with offline sync"

First 30s: clarify scale (concurrent editors per document, total docs), offline window (minutes, hours, days), strong vs eventual consistency needs, and what parts must merge automatically (text/annotations) vs require manual conflict resolution (binary image edits). Skeleton answer pillars: (1) Client local-first model with durable change log in IndexedDB and optimistic UI; (2) Sync protocol using op-based CRDTs with checkpoints and causal metadata; (3) Real-time layer with WebSocket pub/sub for ops and presence; (4) Server storage combining event store + periodic snapshots + object store for blobs. Key tradeoff: pick CRDT for offline convergence and straightforward client merging vs OT for potentially smaller metadata but more server complexity; explicitly state metadata growth and plan compaction/snapshots to bound storage. Close by listing tests (automated property tests for convergence), metrics (op-applied-lag, divergence incidents), and follow-ups: "If I had more time I'd prototype op sizes and run network simulation for 24-hour offline resyncs."

A second angle — "Support collaborative annotations and offline sync for very large PSD files"

Same core concept but different constraints: large binaries make op-granularity on pixels impractical. Use chunked file storage with immutable blobs in S3, and surface a CRDT-managed metadata layer for annotations, layer ordering, and selection. For heavy edits (e.g., filter apply), use server-side transforms with a short-lived lease or optimistic locking to avoid expensive merges. Offline clients sync annotation ops cheaply and fetch or upload updated blobs asynchronously. Explicitly trade immediate local preview (generate low-res proxies client-side) against bandwidth and latency.

Common pitfalls

Pitfall: Designing for strict strong consistency by default. Strong consistency across global clients requires synchronous coordination and kills offline experience; articulate when you’ll accept eventual convergence versus when you require locks or transactions.

Pitfall: Ignoring metadata growth. Naively storing every operation metadata leads to unbounded storage and slow sync; propose snapshotting, tombstone compaction, and periodic delta checkpoints.

Pitfall: Not defining clear conflict semantics. Saying "we merge conflicts" is insufficient — give concrete merge policies for text, layers, annotations, and large binaries, and explain user-visible outcomes (automatic merge vs conflict resolution UI).

Connections

Interviewers may pivot to related areas: designing presence and cursor scalability, consistency models and consensus (Raft/Paxos) where strong coordination is needed, or client reliability and offline UX tradeoffs. Be prepared to discuss testing strategies (property testing, chaos/network partition simulations).