PracHub
QuestionsCoachesLearningGuidesInterview Prep
|Home/System Design/Microsoft

Externally Sort a 500 GB CSV by One Column with 16 GB of RAM

Last updated: Jul 1, 2026

Quick Overview

This question evaluates a candidate's ability to design an external merge sort for data far larger than available memory, a core systems design skill. It tests reasoning about chunking, k-way merging, I/O bottlenecks, and when a database or distributed engine is preferable to a hand-rolled solution. This is a practical systems design scenario commonly used to assess algorithmic and infrastructure trade-off thinking.

  • medium
  • Microsoft
  • System Design
  • Software Engineer

Externally Sort a 500 GB CSV by One Column with 16 GB of RAM

Company: Microsoft

Role: Software Engineer

Category: System Design

Difficulty: medium

Interview Round: Onsite

You are handed a single very large CSV file (roughly 500 GB) sitting on the local disk of one machine. The machine has only about 16 GB of usable RAM. Produce a new CSV file that contains exactly the same records, but fully sorted by one specified column. Design the algorithm and the system around it. ### Constraints & Assumptions - Input is one CSV file, about 500 GB, on local disk; the machine has roughly 16 GB of usable RAM. - There is ample scratch disk (assume at least 2x the input size free). - The sort key is a single column; it may be numeric or textual (confirm which). - The file has a header row; fields may be quoted and may contain embedded commas or newlines. - Output is a single CSV with the same schema, header preserved at the top. - Baseline target is a single commodity machine; a cluster may be available as an extension. ### Clarifying Questions to Ask - Is the comparison numeric, lexicographic, or locale/collation-specific? Ascending or descending? Does the sort need to be stable? - Roughly how many rows, and how wide is an average row? (This sets record count and per-row memory.) - Can fields contain commas or newlines inside quotes, so we need a real CSV parser, or is it a simple delimited format? - Is this a one-off sort, or will the data be sorted/queried repeatedly (which would favor a database or an index)? - Single machine only, or is a distributed engine (Spark, MapReduce) available? - Must the output be a single file, or may it be split into ordered part files? ### Part 1 — Core external sort Design the core algorithm that turns the unsorted 500 GB file into a sorted one given only 16 GB of RAM. ```hint Where to start The data is far larger than RAM, so you cannot sort it in one pass. Sort it in memory-sized pieces, then merge the sorted pieces. This is external merge sort. ``` ```hint The merge step To merge many sorted runs into one output stream, repeatedly emit the smallest current row across all runs. A min-heap (priority queue) keyed on the sort column, holding one row per active run, does this in `O(log R)` per row. ``` #### What This Part Should Cover ```premium-lock What This Part Should Cover ``` ### Part 2 — Sizing, I/O, and parallelism How do you choose the chunk size and the merge fan-in, and where is the bottleneck? What happens if there are more runs than you can merge at once? ```hint Two competing limits Chunk size is bounded by RAM (minus parsing/object overhead). Merge fan-in is bounded by open file handles and the read buffer you can afford per run. When the number of runs exceeds the fan-in, you cannot merge in one pass. ``` ```hint When one merge pass is not enough With `R` runs and a fan-in of `f`, you need about `ceil(log_f R)` merge passes, each re-reading and re-writing the whole dataset. Pick `f` to keep it to a single (or very few) passes. ``` #### What This Part Should Cover ```premium-lock What This Part Should Cover ``` ### Part 3 — Alternatives and reliability When would you instead load the data into a database or a distributed engine rather than hand-rolling a sort? And how do you make a multi-hour job restartable if it dies partway through? ```hint Cost model A one-shot external sort touches the data roughly twice (write runs, then merge). A database pays an up-front indexing cost but amortizes it across many later queries. Match the tool to how often you will read the sorted data. ``` #### What This Part Should Cover ```premium-lock What This Part Should Cover ``` ### What a Strong Answer Covers ```premium-lock What a Strong Answer Covers ``` ### Follow-up Questions - If the file grows to 50 TB and you have a 100-node cluster, how does the design change, and how do you avoid skew when one key value dominates? - How would you efficiently and repeatedly serve pagination requests such as "rows 1,000,000 to 1,000,050 in sorted order"? - How do you make the sort stable, and why might stability matter here? - Suppose you must also deduplicate rows that share the same key. How do you fold that into the merge without a second pass?

Quick Answer: This question evaluates a candidate's ability to design an external merge sort for data far larger than available memory, a core systems design skill. It tests reasoning about chunking, k-way merging, I/O bottlenecks, and when a database or distributed engine is preferable to a hand-rolled solution. This is a practical systems design scenario commonly used to assess algorithmic and infrastructure trade-off thinking.

Related Interview Questions

  • Design a URL Shortener (High-Level and Low-Level Design) - Microsoft (medium)
  • Design a Distributed Key-Value Store - Microsoft (medium)
  • Design a Metrics Ingestion Pipeline - Microsoft (medium)
  • Design a To-Do List Service (CRUD, Auth, Rate Limiting, Caching & API Versioning) - Microsoft (medium)
  • Design A Scalable Web Crawler - Microsoft (medium)
|Home/System Design/Microsoft

Externally Sort a 500 GB CSV by One Column with 16 GB of RAM

Microsoft logo
Microsoft
Jun 11, 2026, 12:00 AM
mediumSoftware EngineerOnsiteSystem Design
1
0

You are handed a single very large CSV file (roughly 500 GB) sitting on the local disk of one machine. The machine has only about 16 GB of usable RAM. Produce a new CSV file that contains exactly the same records, but fully sorted by one specified column. Design the algorithm and the system around it.

Constraints & Assumptions

  • Input is one CSV file, about 500 GB, on local disk; the machine has roughly 16 GB of usable RAM.
  • There is ample scratch disk (assume at least 2x the input size free).
  • The sort key is a single column; it may be numeric or textual (confirm which).
  • The file has a header row; fields may be quoted and may contain embedded commas or newlines.
  • Output is a single CSV with the same schema, header preserved at the top.
  • Baseline target is a single commodity machine; a cluster may be available as an extension.

Clarifying Questions to Ask

  • Is the comparison numeric, lexicographic, or locale/collation-specific? Ascending or descending? Does the sort need to be stable?
  • Roughly how many rows, and how wide is an average row? (This sets record count and per-row memory.)
  • Can fields contain commas or newlines inside quotes, so we need a real CSV parser, or is it a simple delimited format?
  • Is this a one-off sort, or will the data be sorted/queried repeatedly (which would favor a database or an index)?
  • Single machine only, or is a distributed engine (Spark, MapReduce) available?
  • Must the output be a single file, or may it be split into ordered part files?

Part 1 — Core external sort

Design the core algorithm that turns the unsorted 500 GB file into a sorted one given only 16 GB of RAM.

What This Part Should Cover Premium

Part 2 — Sizing, I/O, and parallelism

How do you choose the chunk size and the merge fan-in, and where is the bottleneck? What happens if there are more runs than you can merge at once?

What This Part Should Cover Premium

Part 3 — Alternatives and reliability

When would you instead load the data into a database or a distributed engine rather than hand-rolling a sort? And how do you make a multi-hour job restartable if it dies partway through?

What This Part Should Cover Premium

What a Strong Answer Covers Premium

Follow-up Questions

  • If the file grows to 50 TB and you have a 100-node cluster, how does the design change, and how do you avoid skew when one key value dominates?
  • How would you efficiently and repeatedly serve pagination requests such as "rows 1,000,000 to 1,000,050 in sorted order"?
  • How do you make the sort stable, and why might stability matter here?
  • Suppose you must also deduplicate rows that share the same key. How do you fold that into the merge without a second pass?

Submit Your Answer to Earn 20XP

Sign in to leave a comment

Loading comments...

Browse More Questions

More System Design•More Microsoft•More Software Engineer•Microsoft Software Engineer•Microsoft System Design•Software Engineer System Design

Your design canvas — auto-saved

PracHub

Master your tech interviews with 8,000+ real questions from top companies.

Product

  • Questions
  • Learning Tracks
  • Interview Guides
  • Resources
  • Premium
  • For Universities
  • Student Access

Browse

  • By Company
  • By Role
  • By Category
  • Topic Hubs
  • SQL Questions
  • AI Coding Questions
  • Compare Platforms
  • Discord Community

Support

  • support@prachub.com
  • (916) 541-4762

Legal

  • Privacy Policy
  • Terms of Service
  • About Us

© 2026 PracHub. All rights reserved.