PracHub
QuestionsPremiumCoachesLearningGuidesInterview Prep
|Home/System Design/Google

Design line-preserving file chunker pipeline

Last updated: May 8, 2026

Quick Overview

This question evaluates expertise in designing distributed data pipelines with precise file chunking, line-boundary preservation, exactly-once semantics, buffering and partial-line handling, compression trade-offs, parallelization, and fault tolerance.

  • hard
  • Google
  • System Design
  • Software Engineer

Design line-preserving file chunker pipeline

Company: Google

Role: Software Engineer

Category: System Design

Difficulty: hard

Interview Round: Technical Screen

Design a data pipeline that reads many text files of varying sizes and emits output files of exactly 100 MB each, preserving line boundaries and ensuring every input line appears exactly once in the output. Specify how to handle partial lines at boundaries, buffering, compression, parallelism, and fault tolerance with exactly-once output without duplicates or omissions.

Quick Answer: This question evaluates expertise in designing distributed data pipelines with precise file chunking, line-boundary preservation, exactly-once semantics, buffering and partial-line handling, compression trade-offs, parallelization, and fault tolerance.

Related Interview Questions

  • Design a Security Monitoring Framework - Google (medium)
  • Design an Online Coding Judge Platform - Google (medium)
  • Design Calendar Event Conflict Handling - Google (medium)
  • Design a pub-sub replay system - Google (hard)
  • How to host many domains on one IP? - Google (medium)
Google logo
Google
Sep 6, 2025, 12:00 AM
Software Engineer
Technical Screen
System Design
4
0

System Design: Pack Text Lines into Exact 100 MB Output Files

Design a data pipeline that reads many text files of varying sizes and emits output files of exactly 100 MB each. The pipeline must:

  • Preserve line boundaries (never split a line across files).
  • Ensure every input line appears exactly once in the output (no duplicates, no omissions).
  • Support high parallelism.
  • Handle buffering and partial lines at boundaries.
  • Address compression choices and their impact on file sizing.
  • Provide fault tolerance with exactly-once output semantics.

Provide a detailed design that specifies:

  1. How input files are read and split across workers, handling partial lines at split boundaries.
  2. How lines are grouped into exactly 100 MB output files while preserving boundaries.
  3. Buffering strategies for efficient I/O.
  4. Compression options and their implications for the "exactly 100 MB" requirement.
  5. Parallelization strategy and scaling behavior.
  6. Fault tolerance and exactly-once output (no duplicates or omissions) under retries.

Assume line-delimited UTF-8 text input (LF or CRLF). If you need to make minimal assumptions, state them explicitly.

Solution

Show

Submit Your Answer

Sign in to leave a comment

Loading comments...

Browse More Questions

More System Design•More Google•More Software Engineer•Google Software Engineer•Google System Design•Software Engineer System Design
PracHub

Master your tech interviews with 8,500+ real questions from top companies.

Product

  • Questions
  • Learning Tracks
  • Interview Guides
  • Resources
  • Premium
  • For Universities
  • Student Access

Browse

  • By Company
  • By Role
  • By Category
  • Topic Hubs
  • SQL Questions
  • Compare Platforms
  • Discord Community

Support

  • support@prachub.com
  • (916) 541-4762

Legal

  • Privacy Policy
  • Terms of Service
  • About Us

© 2026 PracHub. All rights reserved.