Adobe Real-Time Collaboration Messaging

What's being tested

Interviewers are probing your ability to design and reason about a real-time collaboration messaging system that meets interactive latency, consistency, and scale requirements while handling concurrent edits and offline clients. Expect to demonstrate tradeoffs between consistency models, operation formats, transport choices, and operational concerns (scaling, monitoring, GC), plus concrete engineering decisions you'd implement and justify. Adobe cares because collaboration features must be low-latency, robust across flaky networks, and scale across many documents and large binary assets.

Core knowledge

WebSocket vs WebRTC: WebSocket is server-mediated, simple NAT traversal and useful for central routing; WebRTC enables P2P/mesh for lower latency but complicates NAT, security, and scaling for many peers.
Operation-based vs state-based replication: operation-based (op-logs) sends deltas; state-based (snapshots) sends full state or CRDT merges. Ops are smaller but need causal delivery and idempotency.
CRDT and OT fundamentals: CRDTs (commutative ops, converge without central ordering) avoid server arbitration; OT preserves intention via transformation and often requires centralized ordering or transformation engine.
Sequence CRDTs: examples like RGA, Logoot, LSEQ handle insert/delete on ordered sequences; they trade metadata size vs tombstone growth and identifier length.
Causality & ordering: vector clocks, Lamport timestamps, or hybrid logical clocks provide causal ordering; choose vector clocks for per-client causality, LTS for compactness at scale.
Idempotency & delivery semantics: design for at-least-once delivery with idempotent ops (unique op IDs) or add exactly-once semantics via dedup tables; at-scale, dedup tables must be bounded/GCed.
Persistence & replication: store durable op-log in Kafka/append-only store and materialize state in Postgres/document DB; use changefeeds for replicas and read-models. Partition by documentId for affinity.
Latency & SLOs: interactive features target sub-200ms end-to-end; cursor/awareness updates aim for <50–100ms; dimension load: p99 matters more than mean.
Offline sync: clients buffer ops in a local log, assign monotonic client sequence numbers, and sync with server. Resolve via CRDT merge or OT transformation; ensure causal dependency tracking to replay safely.
Tombstones & GC: deletion often creates tombstones that accumulate; plan a safe GC protocol (vector-clock based, stable-ancestor watermarks, compaction snapshots).
Large assets & delta sync: for binary blobs, use chunking and content-addressed storage (CAS), CDN for delivery, and server-side reference metadata for collaborative edits rather than sending full binaries through real-time channel.
Security & access control: short-lived auth tokens for signaling, per-document ACLs enforced at server/gateway; end-to-end encryption is possible but complicates server-side merging and search.
Monitoring & testing: track ops/sec, activeDocs, convergenceTime, p99 latency, tombstone growth; use chaos/fault-injection tests and deterministic replay of op-logs for correctness checks.

Worked example — Design a collaborative text editor (multi-user, offline support)

First 30s: clarify SLOs (expected concurrent editors per document, offline tolerance, max doc size, whether server must interpret ops), and ask whether strong intention-preservation is required. Skeleton answer pillars: (1) Transport & connectivity (WebSocket for client-server; fall back to HTTP long-polling), (2) Operation model (choose CRDT like RGA/LSEQ for offline-first convergence or OT if strict intention preservation is mandatory), (3) Persistence & replication (append-only op-log in Kafka + materialized state in Postgres shards by documentId), (4) Causality and idempotency (client-generated unique op IDs, vector clock or LTS for ordering), (5) Operational concerns (tombstone GC, metrics, backpressure). One explicit tradeoff: pick CRDT to avoid central transformation complexity and allow client-side merges, at the cost of larger per-character metadata and more complex GC — acceptable if offline and real-time UX is prioritized. To close: mention incremental improvements you'd implement given time — delta-compression for ops, partial replication for large docs, tests for edge cases (simulated partitions), and policies for tombstone compaction.

A second angle — Real-time collaboration for creative apps with large binary layers

Frame shifts: concurrency is often coarse-grained (layers/objects), not per-character; latency for brush strokes matters but most binary edits are chunked. Use a hybrid model: real-time channel for small metadata/commands (cursor, transform, layer-metadata) using WebSocket, and a background pipeline for large binary chunks with CAS + CDN. For collaboration semantics, prefer object-level CRDTs or operational logs of immutable edits, combined with optimistic locking for high-cost operations (merge requests, layer rebase). Emphasize asset storage separation: do not stream entire blobs through real-time broker; instead exchange references and patch deltas. Tradeoffs: object-level locking simplifies merges but reduces concurrent fine-grained edits; CRDT metadata growth must be managed with snapshotting and compaction.

Common pitfalls

Pitfall: Designing for single-writer assumptions.
Many teams assume one active editor per document and build a strong central sequencer; this fails when offline edits or concurrent mobile editors occur. Better to design for unordered arrival and choose CRDTs or server arbitration explicitly.

Pitfall: Ignoring tombstone and metadata growth.
A correct CRDT/OT design that never compacts leads to unbounded storage and degraded performance; interviewers expect a GC/compaction strategy (epoch checkpoints, stable ancestor computation).

Pitfall: Skipping clarifying questions about SLOs and constraints.
Candidates who jump into CRDT vs OT without asking about offline requirements, expected concurrency, or binary asset sizes miss essential tradeoffs; explicitly state assumptions before choosing an architecture.

Connections

This topic naturally pivots to distributed consensus (when a sequencer or strong leader is chosen), edge caching/CDN for large-asset delivery, and observability/chaos engineering practices to validate real-time guarantees under network partitions.