Adobe Sharded Tenant Data And Transaction Integrity
Asked of: Software Engineer
Last updated
What's being tested
Candidates must show practical mastery of designing sharded, multi-tenant storage that preserves transactional integrity under realistic failure, scaling, and latency constraints. Interviewers probe tradeoffs between strong atomicity (cross-shard transactions) and availability/latency, the mechanics of coordination (consensus, two-phase commit (2PC), sagas), and operational primitives (shard keys, idempotency, retries, partition rebalancing). Adobe cares because tenant isolation, billing correctness, and UX depend on predictable correctness and performance under shard failures and tenant hotspots.
Core knowledge
- Sharding: split by tenant ID vs. tenant+hash; shard key choice controls locality, hotspot risk, and ability to execute single-shard transactions without coordination.
- Multi-tenant isolation: options are logical (shared tables with
tenant_id) vs. physical (per-tenant DB/cluster); physical isolation simplifies transactions but costs more operationally. - Transaction models: know local ACID (single-shard) vs. distributed transactions (cross-shard). Local transactions are cheap; distributed require coordination like two-phase commit (2PC) or consensus-based commits.
- Two-phase commit (2PC): coordinator + participants; costs ≈ 2 round-trips + log writes; blocks during coordinator failure unless using recovery. Use for strict atomicity when latency and availability tradeoffs are acceptable.
- Saga pattern: sequence of local transactions with compensating actions for rollback; provides eventual consistency, lower blocking risk, but requires compensating logic and careful invariants.
- Consensus & metadata: use Raft/Paxos (e.g.,
etcd,Zookeeper) for leader election, shard map, and transaction coordinator metadata; consensus gives metadata correctness but adds complexity and latency. - Isolation levels: serializable ≈ correct but expensive; snapshot isolation prevents many anomalies but can allow write skew. Be ready to defend chosen isolation with data model constraints.
- Idempotency & retry: require idempotency keys and durable dedup stores for clients and coordinators to handle retries and at-least-once semantics without double-apply.
- Failure scenarios: plan for coordinator crash, participant crash, network partition, and disk loss. Recovery relies on durable logs, leader election, and explicit timeout + recovery workflows.
- Rebalancing and resharding: moving tenant partitions must preserve per-tenant ordering and in-flight transactions; safe approach: pause writes, migrate state, update shard map via consensus, resume.
- Performance math: distributed commit latency ≈ 2 * RTT + participant fsync; throughput often bound by slowest participant. Model: effective throughput ≈ min_i(leader_i_throughput)/k for k-way cross-shard transactions.
- Testing and observability: inject partition, pause processes, simulate disk fsync slowness; surface metrics like
p99commit latency, abort rate, stale reads, and broken invariants.
Worked example — "Design sharded tenant data with transactional integrity across shards"
First 30s framing: ask whether cross-tenant transactions are allowed (almost always "no"), average tenant size, read/write ratio, SLOs for commit latency, and failure/availability targets. Skeleton answer: (1) choose a shard key (tenant-id primary) to make most transactions single-shard; (2) implement local ACID on shards (Postgres or similar) for single-shard ops; (3) for necessary cross-shard atomicity, pick between 2PC (strong atomicity) or saga (eventual consistency) and justify by latency/availability tradeoffs; (4) add idempotency keys, coordinator durable logs, and leader election for shard map. Explicit tradeoff: 2PC gives correctness at cost of higher p99 and potential blocking during coordinator failure; sagas reduce blocking but shift complexity into compensations and invariants. Close by proposing validation: run fault-injection tests, measure p99 and aborts, and iterate; "if I had more time, I'd prototype typical flows, implement a coordinator recovery state machine, and benchmark cross-shard commit latency under load."
A second angle — "Support a single large tenant spanning many shards with transactional integrity"
Here the problem flips: tenant isolation exists but one tenant's workload spans shards. Emphasize per-tenant coordination: assign a per-tenant transaction manager/leader (co-located with shard leaders) to reduce cross-tenant interference. Use partitioned two-phase commit limited to shards holding that tenant, and consider optimistic concurrency with conflict detection and retry when write contention is low. If the tenant requires very low-latency atomic ops across shards, consider colocating hot partitions or providing a per-tenant logical leader that serializes writes. The core concepts (shard locality, 2PC vs sagas, idempotency, recovery) remain the same; constraints shift design toward per-tenant orchestration and isolation of quota/throughput.
Common pitfalls
Pitfall: assuming distributed transactions are "just a network call" — candidates often understate the cost of durable logging, multiple fsyncs, and blocking during coordinator failure. Always quantify latency and availability impact.
Pitfall: skipping clarifying questions about tenant semantics — failing to ask whether cross-tenant transactions exist or what invariants must be globally enforced leads to wrong architecture choices (e.g., over-engineering 2PC when single-shard suffices).
Pitfall: treating sagas as "easier" without designing compensations — compensating actions are non-trivial when downstream side effects (billing, external APIs) exist; failing to model compensation or retried partial failure modes leads to correctness bugs.
Connections
Candidates should be ready to pivot to adjacent topics: distributed consensus (Raft/Paxos) for metadata correctness, schema and index design to reduce cross-shard transactions, and operational tooling (leader election, backup/restore, and chaos testing). Interviewers may also ask for storage choices (Postgres vs Cassandra vs Spanner) based on transaction and latency needs.
Further reading
- Spanner: Google’s Globally-Distributed Database — seminal paper on globally-consistent transactions and TrueTime, useful for understanding tradeoffs.
- Designing Data-Intensive Applications — Martin Kleppmann — chapters on partitioning, consensus, and transactions give practical design patterns and failure-mode thinking.