System Design: Pack Text Lines into Exact 100 MB Output Files
Design a data pipeline that reads many text files of varying sizes and emits output files of exactly 100 MB each. The pipeline must:
-
Preserve line boundaries (never split a line across files).
-
Ensure every input line appears exactly once in the output (no duplicates, no omissions).
-
Support high parallelism.
-
Handle buffering and partial lines at boundaries.
-
Address compression choices and their impact on file sizing.
-
Provide fault tolerance with exactly-once output semantics.
Provide a detailed design that specifies:
-
How input files are read and split across workers, handling partial lines at split boundaries.
-
How lines are grouped into exactly 100 MB output files while preserving boundaries.
-
Buffering strategies for efficient I/O.
-
Compression options and their implications for the "exactly 100 MB" requirement.
-
Parallelization strategy and scaling behavior.
-
Fault tolerance and exactly-once output (no duplicates or omissions) under retries.
Assume line-delimited UTF-8 text input (LF or CRLF). If you need to make minimal assumptions, state them explicitly.