How do I approach System Design interview questions?

System Design questions require understanding of core concepts and practice. PracHub provides solutions with explanations to help you master system design interviews.

What difficulty level is this interview question?

This is a medium difficulty System Design question, commonly asked during Technical Screen rounds at xAI.

What role is this question designed for?

This question is commonly asked for Data Engineer candidates at xAI during technical interviews.

Build a One-Pass Data Cleaning Pipeline

Q: Build a One-Pass Data Cleaning Pipeline

This question evaluates a candidate's competence in large-scale data engineering and system design, covering memory-bounded streaming IO, explainable per-document filtering, deduplication strategies, language detection, and metrics-driven data quality for language-model pretraining, and falls under the System Design / Data Engineering domain.

You are given one or more large Parquet files containing millions of publicly available, web-scraped documents. The documents will later be tokenized and used to pretrain a language model, and that model will be evaluated on a standard benchmark suite. Your cleaning and filtering decisions directly determine downstream model quality.

Design and implement a data cleaning pipeline as a single Python script that ingests the Parquet input, removes low-quality data in one pass without loading the whole corpus into memory, and writes a cleaned Parquet dataset suitable for LM training. The pipeline must:

Read the Parquet input efficiently and process it in a single streaming pass.
Detect and filter the common web-scrape failure modes — empty / too-short / too-long documents, boilerplate- and navigation-heavy pages, malformed or mojibake text, non-target-language content, spam / SEO pages, repeated-token gibberish, code-only pages, and exact or near-duplicate documents.
Make the cleaning logic explainable and configurable : thresholds are parameters, and every dropped document is attributable to a specific named filter.
Emit logging / metrics showing how many documents each filter removed.
Output valid Parquet the hiring team can tokenize and train on directly.

This is a ~48-hour take-home followed by a short (~15 min) live walkthrough on your design choices, trade-offs, and how you would validate that the cleaned corpus actually improves model quality.

Constraints & Assumptions

Scale: millions of documents; total size can exceed available RAM (assume tens of millions of docs / hundreds of GB is plausible). Peak memory must be independent of corpus size.
One pass, single machine, single script: read the input once, stream row groups, write accepted rows incrementally. The deliverable is one Python file — but you should be able to explain how it scales out.
Schema: one document per row, with at least a text column. Other fields ( doc_id , url , title , language , crawl timestamp ) are optional and may be missing or null.
Target language is configurable (assume English unless told otherwise).
Determinism & reproducibility: same input + same config produces the same output; the config (thresholds, version) is recorded with the output for audit.
Latency budget: offline/batch — throughput and memory matter; single-document latency does not. The hiring team controls tokenization, training, and evaluation downstream.

Clarifying Questions to Ask

What is the total corpus size (row count and on-disk bytes), and how many input files / row groups? This decides whether an in-memory hash set for exact dedup is viable or needs a disk-backed store.
What is the target language (or languages) , and should code, math, tables, and logs be kept or dropped for this model?
Is near-duplicate removal required for this round, or is exact dedup sufficient with near-dup discussed as future work? How important is filter recall vs. precision?
Is there a PII / safety requirement (remove vs. redact), or is the corpus already cleared as publicly available?
What does the output schema need to include — cleaned text only, or also provenance ( doc_id / url / cleaning version / hash)?
Which upstream fields can I trust (e.g. an existing language label), versus signals I should recompute from text ? What are the hardware limits (RAM, cores, runtime) the script must respect?

What a Strong Answer Covers

Streaming, memory-bounded IO: batched Parquet read + incremental write, peak memory stated and independent of corpus size; only needed columns materialized.
A taxonomy of web-corpus failure modes mapped to concrete, named filters — not a vague "remove bad data."
Conservative text normalization: Unicode normalization, mojibake repair, whitespace/control-char handling — without destroying linguistic signal (casing and meaningful structure preserved).
Explainable filters , each returning a decision and a reason: length bounds, language, boilerplate/navigation, spam/SEO, repeated-token/low-unique-ratio, malformed/low-information, code-only, optional PII.
Exact vs. near deduplication treated as distinct problems, with the right data structures and an honest account of the one-pass / memory limits.
Configurability & observability: thresholds and toggles via CLI/config; nothing hard-coded silently; per-filter drop counts plus aggregates (length distribution, dup rate, throughput).
A validation plan that ties cleaning to model quality: data-level inspection, small-scale training, ablations comparing raw vs. cleaned at equal token budget.
Trade-off awareness: precision vs. recall, one-pass vs. global statistics, single-machine vs. distributed.

Follow-up Questions

Your exact-dedup hash set grows with the number of kept documents. At 10× the data it no longer fits in RAM — how do you keep one-pass dedup memory-bounded (disk-backed KV store, Bloom filter, hash-prefix sharding)? What correctness do you trade away?
A purely one-pass script cannot compute global statistics — corpus-wide duplication rate, domain balance, length percentiles — before deciding what to keep. Which of your thresholds secretly need a global view, and how would you restructure into a two-pass or distributed (Spark/Ray/Beam) job without abandoning streaming?
Aggressive boilerplate/spam filtering shows a 2% drop on a downstream benchmark . How do you diagnose whether a filter is removing genuinely useful documents, and how would you tune or replace the heuristics with a learned quality classifier (and where would the training labels come from)?
How do you make the pipeline reproducible and auditable across reruns and config changes (versioned configs, deterministic hashing, a rejected-sample sidecar for inspection)?

Read the Parquet input efficiently and process it in a single streaming pass.
Detect and filter the common web-scrape failure modes — empty / too-short / too-long documents, boilerplate- and navigation-heavy pages, malformed or mojibake text, non-target-language content, spam / SEO pages, repeated-token gibberish, code-only pages, and exact or near-duplicate documents.
Make the cleaning logic explainable and configurable : thresholds are parameters, and every dropped document is attributable to a specific named filter.
Emit logging / metrics showing how many documents each filter removed.
Output valid Parquet the hiring team can tokenize and train on directly.

This is a ~48-hour take-home followed by a short (~15 min) live walkthrough on your design choices, trade-offs, and how you would validate that the cleaned corpus actually improves model quality.

Constraints & Assumptions

Scale: millions of documents; total size can exceed available RAM (assume tens of millions of docs / hundreds of GB is plausible). Peak memory must be independent of corpus size.
One pass, single machine, single script: read the input once, stream row groups, write accepted rows incrementally. The deliverable is one Python file — but you should be able to explain how it scales out.
Schema: one document per row, with at least a text column. Other fields ( doc_id , url , title , language , crawl timestamp ) are optional and may be missing or null.
Target language is configurable (assume English unless told otherwise).
Determinism & reproducibility: same input + same config produces the same output; the config (thresholds, version) is recorded with the output for audit.
Latency budget: offline/batch — throughput and memory matter; single-document latency does not. The hiring team controls tokenization, training, and evaluation downstream.

Clarifying Questions to Ask

What is the total corpus size (row count and on-disk bytes), and how many input files / row groups? This decides whether an in-memory hash set for exact dedup is viable or needs a disk-backed store.
What is the target language (or languages) , and should code, math, tables, and logs be kept or dropped for this model?
Is near-duplicate removal required for this round, or is exact dedup sufficient with near-dup discussed as future work? How important is filter recall vs. precision?
Is there a PII / safety requirement (remove vs. redact), or is the corpus already cleared as publicly available?
What does the output schema need to include — cleaned text only, or also provenance ( doc_id / url / cleaning version / hash)?
Which upstream fields can I trust (e.g. an existing language label), versus signals I should recompute from text ? What are the hardware limits (RAM, cores, runtime) the script must respect?

What a Strong Answer Covers

Streaming, memory-bounded IO: batched Parquet read + incremental write, peak memory stated and independent of corpus size; only needed columns materialized.
A taxonomy of web-corpus failure modes mapped to concrete, named filters — not a vague "remove bad data."
Conservative text normalization: Unicode normalization, mojibake repair, whitespace/control-char handling — without destroying linguistic signal (casing and meaningful structure preserved).
Explainable filters , each returning a decision and a reason: length bounds, language, boilerplate/navigation, spam/SEO, repeated-token/low-unique-ratio, malformed/low-information, code-only, optional PII.
Exact vs. near deduplication treated as distinct problems, with the right data structures and an honest account of the one-pass / memory limits.
Configurability & observability: thresholds and toggles via CLI/config; nothing hard-coded silently; per-filter drop counts plus aggregates (length distribution, dup rate, throughput).
A validation plan that ties cleaning to model quality: data-level inspection, small-scale training, ablations comparing raw vs. cleaned at equal token budget.
Trade-off awareness: precision vs. recall, one-pass vs. global statistics, single-machine vs. distributed.

Follow-up Questions

Your exact-dedup hash set grows with the number of kept documents. At 10× the data it no longer fits in RAM — how do you keep one-pass dedup memory-bounded (disk-backed KV store, Bloom filter, hash-prefix sharding)? What correctness do you trade away?
A purely one-pass script cannot compute global statistics — corpus-wide duplication rate, domain balance, length percentiles — before deciding what to keep. Which of your thresholds secretly need a global view, and how would you restructure into a two-pass or distributed (Spark/Ray/Beam) job without abandoning streaming?
Aggressive boilerplate/spam filtering shows a 2% drop on a downstream benchmark . How do you diagnose whether a filter is removing genuinely useful documents, and how would you tune or replace the heuristics with a learned quality classifier (and where would the training labels come from)?
How do you make the pipeline reproducible and auditable across reruns and config changes (versioned configs, deterministic hashing, a rejected-sample sidecar for inspection)?

Build a One-Pass Data Cleaning Pipeline

Quick Overview

Constraints & Assumptions

Clarifying Questions to Ask

What a Strong Answer Covers

Follow-up Questions

Solution

Submit Your Answer to Earn 20XP

Build a One-Pass Data Cleaning Pipeline

Quick Overview

Constraints & Assumptions

Clarifying Questions to Ask

What a Strong Answer Covers

Follow-up Questions

Solution

Submit Your Answer to Earn 20XP