Adobe Document Cloud real-time collaboration and offline sync

What's being tested

The interviewer is probing your ability to design a scalable, correct, and resilient real-time collaboration system with robust offline sync for document workloads (annotations, text, form fields). Expect to demonstrate tradeoffs between strong consistency and availability, choose synchronization algorithms (operational transforms vs CRDTs), and specify networking, storage, and garbage-collection approaches that keep latency low at scale. At Adobe, this maps to supporting many concurrent editors per document, low p99 latency for UI updates, and safe merging when clients reconnect after long offline periods.

Core knowledge

Real-time transports: Use `WebSocket` or `WebRTC` for low-latency bidirectional updates; fallback to long-polling for unreliable networks; authenticate with short-lived tokens and renew transparently.
Conflict-resolution families: Know Operational Transformation (OT) vs Conflict-free Replicated Data Types (CRDTs); OT requires a central transform service, CRDTs converge without central ordering but often increase metadata.
CRDT variants: Distinguish state-based vs operation-based vs delta-based CRDTs; operation-based needs reliable causal delivery; delta-CRDTs reduce bandwidth by sending compact deltas.
Sequence CRDTs for text: Use RGA or LSEQ-style sequence CRDTs to represent character/element order; understand tombstones and periodic compaction to reclaim space.
Causality tracking: Implement vector clocks or version vectors to capture causal relationships; vector clock size is O(R) for R replicas, which limits applicability if R is large.
Log and patch model: Store an append-only operation log per document for auditability and replay; support compaction to a snapshot + offset to bound recovery cost.
Delta-sync & checkpoints: Send minimal deltas and support server checkpoints; clients can request state since a known sequence number to resync (three-way sync: base, local, remote).
Offline queueing & retries: Persist pending operations locally (e.g., `IndexedDB` on web, mobile SQLite), assign local Lamport timestamps or causal metadata, and retry with idempotency keys to avoid duplicates.
Consistency & latency tradeoffs: For per-keystroke updates prefer eventual consistency with commutative ops for low p99; for critical operations (permission changes, signing) use linearizable server-confirmed transactions.
Scalability & channeling: Shard by document-id into per-document channels to limit fan-out; use sticky sessions or per-connection routing so transform/CRDT application order is consistent.
Garbage collection: GC tombstones using coordinated compaction or per-document quorum checkpointing; maintain safety windows (based on last-known client vector clocks) before purging.
Security & authorization: Enforce per-operation server-side ACL checks; never trust client-side merge results for access-affecting changes.

Worked example

Problem framing (first 30s): ask which document primitives (text, annotations, images), max concurrent editors, offline window length, and expected p99 latency. Clarify whether the UI requires real-time character-level merging or coarse-grained ops (e.g., annotation add/remove).

Skeleton of a strong answer:

Choose a data model: sequence CRDT for character/ordered elements; map CRDT for annotations/metadata.
Transport & session: `WebSocket` for live edits, fall back to HTTP sync; per-document channel brokered by a routing tier.
Persistence & recovery: append-only op-log in a durable store, periodic snapshots for fast join; local `IndexedDB` for offline queued ops.
Causality & convergence: operation metadata with per-client counter + replica id (or vector-clock when R small); server enforces causal delivery where op-based CRDTs require it.

Key tradeoff to call out: choosing CRDT avoids a central transform service and simplifies offline merging but increases metadata and complicates compaction; OT can be leaner but needs a central sequencer and sticky routing. Close: mention you'd prototype with a small CRDT library, measure per-op metadata/per-user bandwidth, and if time allowed, implement tombstone GC and stress-test reconnect scenarios.

A second angle

If the question restricts scope to annotation-only collaboration (no fine-grained text edit), the same concepts simplify: model annotations as a CRDT map with commutative add/remove operations and versioned attributes. This reduces sequence complexity and vector-clock pressure because operations are coarser-grained and can be batched. The interviewer may pivot to "how to display presence and edit intent" — answer by keeping ephemeral per-session metadata in an in-memory pub/sub layer, separate from durable CRDT state. A different constraint is strict auditability (immutable history): prefer append-only logs and server-authoritative timestamps, and perform merges deterministically to preserve provenance.

Common pitfalls

Pitfall: Choosing OT without addressing sticky routing.
Many propose OT and omit how to guarantee the same transform order across servers; without sticky sessions or a single transform head, divergent states appear. Better: specify routing or a sequencer and how you failover it.

Pitfall: Ignoring metadata growth from CRDTs.
A tempting answer praises CRDT convergence but forgets tombstones/vector-clock growth; interviewers expect a GC/compaction plan and bounded storage strategy.

Pitfall: Treating offline sync as "replay ops blindly".
Naively replaying client ops can break causality and permissions. Always validate ops server-side, verify causal dependencies, and offer three-way merge/resolution flows for incompatible changes.

Connections

Interviewers often pivot to related topics: real-time presence & presence scaling (heartbeat, ephemeral state, presence partitioning) and data storage choices (`Cassandra` vs `Postgres` for storing op logs and snapshots). They may also ask about monitoring (`p99` latency, dropped-update rates) and end-to-end testing (chaos tests for reconnects).