How to stream a large file to 1000 hosts fastest
Company: Anthropic
Role: Software Engineer
Category: System Design
Difficulty: medium
Interview Round: Onsite
## Problem
You need to distribute one very large file, stored in cloud object storage, to **1000 servers** inside a single data center. Every server must end up with a complete, byte-identical copy of the file. Design the **fastest** delivery scheme — the metric is *makespan*, the time until the **last** of the 1000 servers has the full file.
The two relevant capacity limits are:
- **WAN ingress (cloud storage → data center):** **1 Gb/s** total. This is the aggregate pipe into the DC; pulling from cloud storage on $N$ servers in parallel still shares this same 1 Gb/s — it does *not* scale with the number of downloaders.
- **Per-server NIC:** **1 Gb/s** each, full-duplex (a host can receive at up to 1 Gb/s and send at up to 1 Gb/s simultaneously).
Assume the data-center fabric has high bisection bandwidth, so intra-DC links are not the first-order bottleneck — but call out where top-of-rack (ToR) oversubscription would bite.
```hint Where to start — find the floor
Before designing anything, ask what *no* scheme can beat. Which single resource must every byte pass through, and how does its capacity bound the best-possible makespan? Pin down that floor (in terms of file size $S$) first — the rest of the problem is "how close can we get to it?"
```
```hint A trap worth pricing out
Sketch the most obvious approach — every server fetching from cloud storage on its own — and reason carefully about what those parallel pulls actually share. How does its makespan compare to the floor you just found? Seeing how badly this scales should point you toward where replication ought to happen.
```
```hint Making the rest of the work cheap
Once the data is past the scarce resource, intra-DC bandwidth is plentiful and NICs are full-duplex. Think about how the set of servers that already hold the data can grow each round, and what that implies for how many rounds it takes to cover 1000 hosts. Then ask: does the whole file have to arrive before replication can begin, or can you break it up so the two stages overlap?
```
```hint Edge cases / pitfalls to weigh
What guarantees each host got the *exact* file? Where might a bottleneck reappear that isn't the WAN — the ingress node's own I/O, cross-rack links, or sheer connection count at $N=1000$? And what happens if whatever first pulls the data over the WAN dies partway through?
```
### Constraints & Assumptions
- File size $S$ is "very large" (think tens to hundreds of GB) — large enough that transfer time dominates any setup/handshake/control overhead.
- WAN ingress is a hard aggregate cap of **1 Gb/s**; it does not scale with the number of downloaders.
- NICs are **1 Gb/s full-duplex**, so a host can upload and download concurrently at line rate.
- The internal DC fabric has high bisection bandwidth; treat intra-rack links as plentiful and call out cross-rack/ToR oversubscription explicitly where it matters.
- Object storage supports ranged/parallel reads (chunked GETs).
- You may run an agent on every host and stand up a small control/metadata service.
- All 1000 servers are reachable and cooperative in the base case; the follow-up relaxes this.
- Correctness requires every host to obtain the **exact, verified** full file.
### Clarifying Questions to Ask
- How large is the file, and how often does this run (one-off push vs. recurring fleet-wide deploy)? This affects whether a persistent peer agent / caching layer is worth it.
- Is the 1 Gb/s WAN cap a hard physical/contractual limit, or could we provision more ingress or place a CDN/edge cache closer to the DC?
- What is the topology — single rack or many racks, and what is the oversubscription ratio on ToR uplinks and the spine?
- Are NICs truly full-duplex 1 Gb/s, and can servers talk peer-to-peer freely, or does network policy restrict east-west traffic?
- Is "all 1000 complete" a strict requirement, or is "99% within time $T$, stragglers best-effort" acceptable?
- How much spare disk/memory does each host have to buffer and re-serve chunks?
### What a Strong Answer Covers
- Whether the candidate identifies the binding constraint and derives a **lower bound** on makespan from it, rather than jumping straight to a mechanism.
- Whether they recognize why having every host pull independently is catastrophic, and reason about *what* those parallel pulls actually contend for.
- Whether they separate the problem into a WAN-ingress stage and an in-DC replication stage, and argue why replication inside the DC is cheap.
- Whether they find a way to make in-DC replication scale sub-linearly in the host count, and whether they overlap the two stages instead of running them back-to-back.
- The depth of their **dissemination design** — do they compare structural options, name concrete tradeoffs, and justify a choice (including locality / cross-rack handling)?
- Whether they address **integrity** end-to-end and describe a coherent coordination/metadata story.
- Whether they anticipate bottlenecks *other than* the WAN (ingress-node I/O, ToR oversubscription, connection limits, backpressure).
- The quality of their **failure handling** — seed loss, stragglers/tail latency, and the bad-network follow-up.
- Whether they can produce a back-of-envelope makespan estimate that ties the design back to the lower bound.
### Follow-up Questions
- How does the design change if **some hosts have a poor or unstable network** (slow, lossy, or frequently disconnecting)? How do you stop them from slowing everyone else while still ensuring they eventually complete?
- The last 1–2% of hosts dominate the tail. What concrete techniques cut **straggler / tail latency**?
- When would you reach for **erasure coding** (e.g., Reed–Solomon) instead of exact-chunk retries, and what does it cost you?
- The single ingress seed is a single point of failure and a potential I/O bottleneck. How do you make the ingress path resilient *without* exceeding the 1 Gb/s WAN cap, and how do you verify end-to-end integrity?
Quick Answer: This question evaluates system design and distributed-systems skills, especially bandwidth and bottleneck analysis, replication and pipelining trade-offs, and makespan-based performance reasoning.