Externally Sort a 500 GB CSV by One Column with 16 GB of RAM
Company: Microsoft
Role: Software Engineer
Category: System Design
Difficulty: medium
Interview Round: Onsite
You are handed a single very large CSV file (roughly 500 GB) sitting on the local disk of one machine. The machine has only about 16 GB of usable RAM. Produce a new CSV file that contains exactly the same records, but fully sorted by one specified column. Design the algorithm and the system around it.
### Constraints & Assumptions
- Input is one CSV file, about 500 GB, on local disk; the machine has roughly 16 GB of usable RAM.
- There is ample scratch disk (assume at least 2x the input size free).
- The sort key is a single column; it may be numeric or textual (confirm which).
- The file has a header row; fields may be quoted and may contain embedded commas or newlines.
- Output is a single CSV with the same schema, header preserved at the top.
- Baseline target is a single commodity machine; a cluster may be available as an extension.
### Clarifying Questions to Ask
- Is the comparison numeric, lexicographic, or locale/collation-specific? Ascending or descending? Does the sort need to be stable?
- Roughly how many rows, and how wide is an average row? (This sets record count and per-row memory.)
- Can fields contain commas or newlines inside quotes, so we need a real CSV parser, or is it a simple delimited format?
- Is this a one-off sort, or will the data be sorted/queried repeatedly (which would favor a database or an index)?
- Single machine only, or is a distributed engine (Spark, MapReduce) available?
- Must the output be a single file, or may it be split into ordered part files?
### Part 1 — Core external sort
Design the core algorithm that turns the unsorted 500 GB file into a sorted one given only 16 GB of RAM.
```hint Where to start
The data is far larger than RAM, so you cannot sort it in one pass. Sort it in memory-sized pieces, then merge the sorted pieces. This is external merge sort.
```
```hint The merge step
To merge many sorted runs into one output stream, repeatedly emit the smallest current row across all runs. A min-heap (priority queue) keyed on the sort column, holding one row per active run, does this in `O(log R)` per row.
```
#### What This Part Should Cover
```premium-lock What This Part Should Cover
```
### Part 2 — Sizing, I/O, and parallelism
How do you choose the chunk size and the merge fan-in, and where is the bottleneck? What happens if there are more runs than you can merge at once?
```hint Two competing limits
Chunk size is bounded by RAM (minus parsing/object overhead). Merge fan-in is bounded by open file handles and the read buffer you can afford per run. When the number of runs exceeds the fan-in, you cannot merge in one pass.
```
```hint When one merge pass is not enough
With `R` runs and a fan-in of `f`, you need about `ceil(log_f R)` merge passes, each re-reading and re-writing the whole dataset. Pick `f` to keep it to a single (or very few) passes.
```
#### What This Part Should Cover
```premium-lock What This Part Should Cover
```
### Part 3 — Alternatives and reliability
When would you instead load the data into a database or a distributed engine rather than hand-rolling a sort? And how do you make a multi-hour job restartable if it dies partway through?
```hint Cost model
A one-shot external sort touches the data roughly twice (write runs, then merge). A database pays an up-front indexing cost but amortizes it across many later queries. Match the tool to how often you will read the sorted data.
```
#### What This Part Should Cover
```premium-lock What This Part Should Cover
```
### What a Strong Answer Covers
```premium-lock What a Strong Answer Covers
```
### Follow-up Questions
- If the file grows to 50 TB and you have a 100-node cluster, how does the design change, and how do you avoid skew when one key value dominates?
- How would you efficiently and repeatedly serve pagination requests such as "rows 1,000,000 to 1,000,050 in sorted order"?
- How do you make the sort stable, and why might stability matter here?
- Suppose you must also deduplicate rows that share the same key. How do you fold that into the merge without a second pass?
Quick Answer: This question evaluates a candidate's ability to design an external merge sort for data far larger than available memory, a core systems design skill. It tests reasoning about chunking, k-way merging, I/O bottlenecks, and when a database or distributed engine is preferable to a hand-rolled solution. This is a practical systems design scenario commonly used to assess algorithmic and infrastructure trade-off thinking.