How do I approach System Design interview questions?

System Design questions require understanding of core concepts and practice. PracHub provides solutions with explanations to help you master system design interviews.

What difficulty level is this interview question?

This is a medium difficulty System Design question, commonly asked during Onsite rounds at Microsoft.

What role is this question designed for?

This question is commonly asked for Software Engineer candidates at Microsoft during technical interviews.

Externally Sort a 500 GB CSV by One Column with 16 GB of RAM

Q: Externally Sort a 500 GB CSV by One Column with 16 GB of RAM

This question evaluates a candidate's ability to design an external merge sort for data far larger than available memory, a core systems design skill. It tests reasoning about chunking, k-way merging, I/O bottlenecks, and when a database or distributed engine is preferable to a hand-rolled solution. This is a practical systems design scenario commonly used to assess algorithmic and infrastructure trade-off thinking.

You are handed a single very large CSV file (roughly 500 GB) sitting on the local disk of one machine. The machine has only about 16 GB of usable RAM. Produce a new CSV file that contains exactly the same records, but fully sorted by one specified column. Design the algorithm and the system around it.

Constraints & Assumptions

Input is one CSV file, about 500 GB, on local disk; the machine has roughly 16 GB of usable RAM.
There is ample scratch disk (assume at least 2x the input size free).
The sort key is a single column; it may be numeric or textual (confirm which).
The file has a header row; fields may be quoted and may contain embedded commas or newlines.
Output is a single CSV with the same schema, header preserved at the top.
Baseline target is a single commodity machine; a cluster may be available as an extension.

Clarifying Questions to Ask

Is the comparison numeric, lexicographic, or locale/collation-specific? Ascending or descending? Does the sort need to be stable?
Roughly how many rows, and how wide is an average row? (This sets record count and per-row memory.)
Can fields contain commas or newlines inside quotes, so we need a real CSV parser, or is it a simple delimited format?
Is this a one-off sort, or will the data be sorted/queried repeatedly (which would favor a database or an index)?
Single machine only, or is a distributed engine (Spark, MapReduce) available?
Must the output be a single file, or may it be split into ordered part files?

Part 1 — Core external sort

Design the core algorithm that turns the unsorted 500 GB file into a sorted one given only 16 GB of RAM.

What This Part Should Cover Premium

Part 2 — Sizing, I/O, and parallelism

How do you choose the chunk size and the merge fan-in, and where is the bottleneck? What happens if there are more runs than you can merge at once?

What This Part Should Cover Premium

Part 3 — Alternatives and reliability

When would you instead load the data into a database or a distributed engine rather than hand-rolling a sort? And how do you make a multi-hour job restartable if it dies partway through?

What This Part Should Cover Premium

What a Strong Answer Covers Premium

Follow-up Questions

If the file grows to 50 TB and you have a 100-node cluster, how does the design change, and how do you avoid skew when one key value dominates?
How would you efficiently and repeatedly serve pagination requests such as "rows 1,000,000 to 1,000,050 in sorted order"?
How do you make the sort stable, and why might stability matter here?
Suppose you must also deduplicate rows that share the same key. How do you fold that into the merge without a second pass?

Constraints & Assumptions

Input is one CSV file, about 500 GB, on local disk; the machine has roughly 16 GB of usable RAM.
There is ample scratch disk (assume at least 2x the input size free).
The sort key is a single column; it may be numeric or textual (confirm which).
The file has a header row; fields may be quoted and may contain embedded commas or newlines.
Output is a single CSV with the same schema, header preserved at the top.
Baseline target is a single commodity machine; a cluster may be available as an extension.

Clarifying Questions to Ask

Is the comparison numeric, lexicographic, or locale/collation-specific? Ascending or descending? Does the sort need to be stable?
Roughly how many rows, and how wide is an average row? (This sets record count and per-row memory.)
Can fields contain commas or newlines inside quotes, so we need a real CSV parser, or is it a simple delimited format?
Is this a one-off sort, or will the data be sorted/queried repeatedly (which would favor a database or an index)?
Single machine only, or is a distributed engine (Spark, MapReduce) available?
Must the output be a single file, or may it be split into ordered part files?

Part 1 — Core external sort

Design the core algorithm that turns the unsorted 500 GB file into a sorted one given only 16 GB of RAM.

What This Part Should Cover Premium

Part 2 — Sizing, I/O, and parallelism

How do you choose the chunk size and the merge fan-in, and where is the bottleneck? What happens if there are more runs than you can merge at once?

What This Part Should Cover Premium

Part 3 — Alternatives and reliability

When would you instead load the data into a database or a distributed engine rather than hand-rolling a sort? And how do you make a multi-hour job restartable if it dies partway through?

What This Part Should Cover Premium

What a Strong Answer Covers Premium

Follow-up Questions

If the file grows to 50 TB and you have a 100-node cluster, how does the design change, and how do you avoid skew when one key value dominates?
How would you efficiently and repeatedly serve pagination requests such as "rows 1,000,000 to 1,000,050 in sorted order"?
How do you make the sort stable, and why might stability matter here?
Suppose you must also deduplicate rows that share the same key. How do you fold that into the merge without a second pass?

Externally Sort a 500 GB CSV by One Column with 16 GB of RAM

Quick Overview

Externally Sort a 500 GB CSV by One Column with 16 GB of RAM

Constraints & Assumptions

Clarifying Questions to Ask

Part 1 — Core external sort

What This Part Should Cover Premium

Part 2 — Sizing, I/O, and parallelism

What This Part Should Cover Premium

Part 3 — Alternatives and reliability

What This Part Should Cover Premium

What a Strong Answer Covers Premium

Follow-up Questions

Submit Your Answer to Earn 20XP

Externally Sort a 500 GB CSV by One Column with 16 GB of RAM

Quick Overview

Externally Sort a 500 GB CSV by One Column with 16 GB of RAM

Constraints & Assumptions

Clarifying Questions to Ask

Part 1 — Core external sort

What This Part Should Cover Premium

Part 2 — Sizing, I/O, and parallelism

What This Part Should Cover Premium

Part 3 — Alternatives and reliability

What This Part Should Cover Premium

What a Strong Answer Covers Premium

Follow-up Questions

Submit Your Answer to Earn 20XP