How do I approach System Design interview questions?

System Design questions require understanding of core concepts and practice. PracHub provides solutions with explanations to help you master system design interviews.

What difficulty level is this interview question?

This is a medium difficulty System Design question, commonly asked during Technical Screen rounds at Wells Fargo.

What role is this question designed for?

This question is commonly asked for Software Engineer candidates at Wells Fargo during technical interviews.

Design a Resilient Document Aggregation Service | Wells Fargo Interview Question

Q: Design a Resilient Document Aggregation Service

This system design question tests a candidate's ability to architect a fault-tolerant, high-throughput data aggregation service involving concurrent downstream API calls, resilience patterns, and read/write separation. It evaluates practical mastery of distributed systems concepts including circuit breakers, dead letter queues, CQRS, and eventual consistency at scale.

Design a Resilient Document Aggregation Service

Your team owns a backend service that assembles a "document bundle" for clients. To build one bundle, the service must call three independent, already-existing downstream service APIs, each of which returns a partial document (for example: a profile section, a transactions section, and a statements section). The service must aggregate the three partial documents into a single combined record, persist it to a database, and serve that persisted, aggregated record to clients on demand.

Design an end-to-end solution for this service. The downstream APIs are operated by other teams, have their own latency and availability characteristics, and can fail or be slow at any time. Your design must continue to function gracefully when one or more downstream calls fail or time out.

Constraints & Assumptions

Three downstream document APIs, each owned by a different team. Assume each has p99 latency in the low hundreds of milliseconds and an availability of roughly 99.5%, with occasional multi-minute outages.
Aggregation requests arrive at a peak of a few hundred per second; client reads of already-aggregated bundles are roughly 10x the write rate.
A persisted aggregated bundle must be available to clients with read latency in the tens of milliseconds at p99.
Bundles can tolerate eventual consistency: a client may briefly read a slightly stale bundle, but the system must converge.
A bundle should not be silently dropped on partial failure; every aggregation attempt must reach a terminal, observable outcome (succeeded, retried, or parked for investigation).
Assume the three downstream calls for a given bundle are independent of each other (no ordering dependency between them).

Clarifying Questions to Ask

Is a bundle valid only when all three sections are present, or can it be persisted as "partial" with missing sections filled in later? This drives whether failures block the whole bundle or only one section.
What is the freshness requirement — must a bundle reflect the latest downstream data on every client read, or is periodic/triggered refresh acceptable?
Are aggregation requests triggered synchronously by a client request, or asynchronously by an event/schedule? This decides whether the client waits for the downstream fan-out.
What is the idempotency key for a bundle (e.g. a customer/document id), and can the same aggregation be safely re-run without producing duplicates?
What are the retry/SLA expectations from the downstream teams — are their endpoints idempotent and safe to retry?
What is the data-retention and PII/compliance posture for the persisted bundles (relevant for a financial-services context)?

Part 1

Design the write path: how an aggregation request flows from arrival through the three downstream calls to a persisted bundle. Explain how you fan out the three calls, how you combine the results, and where the orchestration logic lives.

What This Part Should Cover Premium

Part 2

Make the write path resilient to downstream failure. Specify exactly what happens when one (or more) of the three calls times out, returns an error, or the downstream is fully down. Cover transient vs. persistent failure and how a request reaches a terminal, observable outcome.

What This Part Should Cover Premium

Part 3

Design the read path and scaling for serving already-aggregated bundles to clients, given reads are ~10x writes and must return in the tens of milliseconds at p99.

What This Part Should Cover Premium

What a Strong Answer Covers Premium

Follow-up Questions

How would you guarantee exactly-once-effect persistence of a bundle when retries and at-least-once queue delivery can both replay a message?
A downstream team ships a breaking change to one API's response schema. How does your design detect and contain the blast radius, and how do you version the contract?
One downstream is consistently slow but not failing (high latency, no errors). Your circuit breaker stays closed. How do you protect overall bundle latency and the thread/connection pool?
How would you run a DLQ replay safely after a multi-hour downstream outage without overwhelming the now-recovered dependency?

Design a Resilient Document Aggregation Service

Constraints & Assumptions

Three downstream document APIs, each owned by a different team. Assume each has p99 latency in the low hundreds of milliseconds and an availability of roughly 99.5%, with occasional multi-minute outages.
Aggregation requests arrive at a peak of a few hundred per second; client reads of already-aggregated bundles are roughly 10x the write rate.
A persisted aggregated bundle must be available to clients with read latency in the tens of milliseconds at p99.
Bundles can tolerate eventual consistency: a client may briefly read a slightly stale bundle, but the system must converge.
A bundle should not be silently dropped on partial failure; every aggregation attempt must reach a terminal, observable outcome (succeeded, retried, or parked for investigation).
Assume the three downstream calls for a given bundle are independent of each other (no ordering dependency between them).

Clarifying Questions to Ask

Is a bundle valid only when all three sections are present, or can it be persisted as "partial" with missing sections filled in later? This drives whether failures block the whole bundle or only one section.
What is the freshness requirement — must a bundle reflect the latest downstream data on every client read, or is periodic/triggered refresh acceptable?
Are aggregation requests triggered synchronously by a client request, or asynchronously by an event/schedule? This decides whether the client waits for the downstream fan-out.
What is the idempotency key for a bundle (e.g. a customer/document id), and can the same aggregation be safely re-run without producing duplicates?
What are the retry/SLA expectations from the downstream teams — are their endpoints idempotent and safe to retry?
What is the data-retention and PII/compliance posture for the persisted bundles (relevant for a financial-services context)?

Part 1

What This Part Should Cover Premium

Part 2

What This Part Should Cover Premium

Part 3

Design the read path and scaling for serving already-aggregated bundles to clients, given reads are ~10x writes and must return in the tens of milliseconds at p99.

What This Part Should Cover Premium

What a Strong Answer Covers Premium

Follow-up Questions

How would you guarantee exactly-once-effect persistence of a bundle when retries and at-least-once queue delivery can both replay a message?
A downstream team ships a breaking change to one API's response schema. How does your design detect and contain the blast radius, and how do you version the contract?
One downstream is consistently slow but not failing (high latency, no errors). Your circuit breaker stays closed. How do you protect overall bundle latency and the thread/connection pool?
How would you run a DLQ replay safely after a multi-hour downstream outage without overwhelming the now-recovered dependency?

Design a Resilient Document Aggregation Service

Quick Overview

Design a Resilient Document Aggregation Service

Constraints & Assumptions

Clarifying Questions to Ask

Part 1

What This Part Should Cover Premium

Part 2

What This Part Should Cover Premium

Part 3

What This Part Should Cover Premium

What a Strong Answer Covers Premium

Follow-up Questions

Solution

Submit Your Answer to Earn 20XP

Design a Resilient Document Aggregation Service

Quick Overview

Design a Resilient Document Aggregation Service

Constraints & Assumptions

Clarifying Questions to Ask

Part 1

What This Part Should Cover Premium

Part 2

What This Part Should Cover Premium

Part 3

What This Part Should Cover Premium

What a Strong Answer Covers Premium

Follow-up Questions

Solution

Submit Your Answer to Earn 20XP