Design a Resilient Document Aggregation Service
Company: Wells Fargo
Role: Software Engineer
Category: System Design
Difficulty: medium
Interview Round: Technical Screen
## Design a Resilient Document Aggregation Service
Your team owns a backend service that assembles a "document bundle" for clients. To build one bundle, the service must call **three independent, already-existing downstream service APIs**, each of which returns a partial document (for example: a profile section, a transactions section, and a statements section). The service must **aggregate** the three partial documents into a single combined record, **persist** it to a database, and serve that persisted, aggregated record to clients on demand.
Design an **end-to-end solution** for this service. The downstream APIs are operated by other teams, have their own latency and availability characteristics, and can fail or be slow at any time. Your design must continue to function gracefully when one or more downstream calls fail or time out.
### Constraints & Assumptions
- Three downstream document APIs, each owned by a different team. Assume each has p99 latency in the low hundreds of milliseconds and an availability of roughly 99.5%, with occasional multi-minute outages.
- Aggregation requests arrive at a peak of a few hundred per second; client reads of already-aggregated bundles are roughly 10x the write rate.
- A persisted aggregated bundle must be available to clients with read latency in the tens of milliseconds at p99.
- Bundles can tolerate eventual consistency: a client may briefly read a slightly stale bundle, but the system must converge.
- A bundle should not be silently dropped on partial failure; every aggregation attempt must reach a terminal, observable outcome (succeeded, retried, or parked for investigation).
- Assume the three downstream calls for a given bundle are independent of each other (no ordering dependency between them).
### Clarifying Questions to Ask
- Is a bundle valid only when **all three** sections are present, or can it be persisted as "partial" with missing sections filled in later? This drives whether failures block the whole bundle or only one section.
- What is the freshness requirement — must a bundle reflect the latest downstream data on every client read, or is periodic/triggered refresh acceptable?
- Are aggregation requests triggered synchronously by a client request, or asynchronously by an event/schedule? This decides whether the client waits for the downstream fan-out.
- What is the idempotency key for a bundle (e.g. a customer/document id), and can the same aggregation be safely re-run without producing duplicates?
- What are the retry/SLA expectations from the downstream teams — are their endpoints idempotent and safe to retry?
- What is the data-retention and PII/compliance posture for the persisted bundles (relevant for a financial-services context)?
### Part 1
Design the **write path**: how an aggregation request flows from arrival through the three downstream calls to a persisted bundle. Explain how you fan out the three calls, how you combine the results, and where the orchestration logic lives.
```hint Decoupling the trigger from the work
Put an ingestion queue between the request and the heavy fan-out so a downstream slowdown creates backpressure instead of blocking callers. Think about an **orchestrator** that owns the lifecycle of one bundle and issues the three calls **asynchronously / in parallel**, then joins the results.
```
```hint Combining independent calls
Because the three calls are independent, you can issue them concurrently and join (e.g. `CompletableFuture.allOf`, a reactive `zip`, or a scatter-gather step in the orchestrator) rather than calling them serially — the bundle latency becomes ~max(call) instead of sum(call).
```
#### What This Part Should Cover
```premium-lock What This Part Should Cover
```
### Part 2
Make the write path **resilient to downstream failure**. Specify exactly what happens when one (or more) of the three calls times out, returns an error, or the downstream is fully down. Cover transient vs. persistent failure and how a request reaches a terminal, observable outcome.
```hint Stop hammering a sick dependency
A **circuit breaker** per downstream trips open after a failure threshold so you fail fast instead of piling up calls on a dead service; pair it with bounded **timeouts**, **retries with exponential backoff + jitter**, and a fallback (cached/last-known section or a "partial bundle" marker).
```
```hint Where do permanently-failing messages go
After retries are exhausted, a message must not be lost or retried forever — route it to a **dead letter queue (DLQ)** for inspection/replay so every attempt has a terminal state.
```
#### What This Part Should Cover
```premium-lock What This Part Should Cover
```
### Part 3
Design the **read path and scaling** for serving already-aggregated bundles to clients, given reads are ~10x writes and must return in the tens of milliseconds at p99.
```hint Split how you write from how you read
Reads and writes have very different shapes here. Consider **CQRS**: an optimized read model/store separate from the write model, served from **read replicas** and/or a cache, so heavy read traffic never contends with the aggregation write load.
```
#### What This Part Should Cover
```premium-lock What This Part Should Cover
```
### What a Strong Answer Covers
```premium-lock What a Strong Answer Covers
```
### Follow-up Questions
- How would you guarantee **exactly-once-effect** persistence of a bundle when retries and at-least-once queue delivery can both replay a message?
- A downstream team ships a breaking change to one API's response schema. How does your design detect and contain the blast radius, and how do you version the contract?
- One downstream is consistently slow but not failing (high latency, no errors). Your circuit breaker stays closed. How do you protect overall bundle latency and the thread/connection pool?
- How would you run a **DLQ replay** safely after a multi-hour downstream outage without overwhelming the now-recovered dependency?
Quick Answer: This system design question tests a candidate's ability to architect a fault-tolerant, high-throughput data aggregation service involving concurrent downstream API calls, resilience patterns, and read/write separation. It evaluates practical mastery of distributed systems concepts including circuit breakers, dead letter queues, CQRS, and eventual consistency at scale.