Build a One-Pass Data Cleaning Pipeline
Company: xAI
Role: Data Engineer
Category: System Design
Difficulty: medium
Interview Round: Technical Screen
You are given one or more large Parquet files containing **millions of publicly available, web-scraped documents**. The documents will later be tokenized and used to **pretrain a language model**, and that model will be evaluated on a standard benchmark suite. Your cleaning and filtering decisions directly determine downstream model quality.
Design and implement a **data cleaning pipeline as a single Python script** that ingests the Parquet input, removes low-quality data in **one pass** without loading the whole corpus into memory, and writes a cleaned Parquet dataset suitable for LM training. The pipeline must:
- Read the Parquet input efficiently and process it in a single streaming pass.
- Detect and filter the common web-scrape failure modes — empty / too-short / too-long documents, boilerplate- and navigation-heavy pages, malformed or mojibake text, non-target-language content, spam / SEO pages, repeated-token gibberish, code-only pages, and exact or near-duplicate documents.
- Make the cleaning logic **explainable and configurable**: thresholds are parameters, and every dropped document is attributable to a specific named filter.
- Emit **logging / metrics** showing how many documents each filter removed.
- Output valid Parquet the hiring team can tokenize and train on directly.
This is a ~48-hour take-home followed by a short (~15 min) live walkthrough on your design choices, trade-offs, and how you would validate that the cleaned corpus actually improves model quality.
```hint Where to start
Separate the three concerns this pipeline really has: moving bytes in and out (IO), judging a single document on its own (per-document filters), and reasoning about a document relative to others already seen (dedup). They have very different memory and testing profiles. The "configurable + every drop is attributable" requirement falls out naturally if each filter is a pure function returning `(keep, reason)`.
```
```hint Memory-bounded IO
The binding constraint is that peak memory must not grow with the corpus. Read the Parquet input *incrementally* — `pyarrow.dataset.to_batches()` (row groups) or a `polars` lazy/streaming scan — and write output incrementally with a `ParquetWriter`. Then state your peak working set in one sentence: if it doesn't mention the corpus size, you're on the right track.
```
```hint Order filters cheap-to-expensive
Some signals are nearly free (length, character-class ratios); others are expensive (language ID, dedup index lookups). Run the cheap ones first and short-circuit, so expensive checks only touch survivors. Note one ordering trap: language detection is unreliable on very short text, so it must run *after* the minimum-length filter.
```
```hint Deduplication is the part with real depth
"Same document" means two different things needing different machinery. **Exact** dedup = hash the normalized text into a `set`/disk-backed store (single pass, O(1) membership). **Near** dedup = MinHash + LSH over token shingles (Jaccard) or SimHash + Hamming distance. The hard part under a one-pass, memory-limited constraint is that the near-dup index grows with kept docs — say honestly how you bound it and what you defer to an offline/distributed job.
```
```hint Don't over-filter
The objective is *model quality*, not maximal removal. Aggressive filters discard rare-but-valuable data (technical text, math, code, multilingual). Prefer ratios over absolute counts, make every threshold a config knob, and design so you can **ablate** filters to measure each one's effect on the trained model.
```
### Constraints & Assumptions
- **Scale:** millions of documents; total size can exceed available RAM (assume tens of millions of docs / hundreds of GB is plausible). Peak memory must be independent of corpus size.
- **One pass, single machine, single script:** read the input once, stream row groups, write accepted rows incrementally. The deliverable is one Python file — but you should be able to explain how it scales out.
- **Schema:** one document per row, with at least a `text` column. Other fields (`doc_id`, `url`, `title`, `language`, crawl `timestamp`) are optional and may be missing or null.
- **Target language** is configurable (assume English unless told otherwise).
- **Determinism & reproducibility:** same input + same config produces the same output; the config (thresholds, version) is recorded with the output for audit.
- **Latency budget:** offline/batch — throughput and memory matter; single-document latency does not. The hiring team controls tokenization, training, and evaluation downstream.
### Clarifying Questions to Ask
- What is the **total corpus size** (row count and on-disk bytes), and how many input files / row groups? This decides whether an in-memory hash set for exact dedup is viable or needs a disk-backed store.
- What is the **target language (or languages)**, and should code, math, tables, and logs be kept or dropped for this model?
- Is **near-duplicate** removal required for this round, or is exact dedup sufficient with near-dup discussed as future work? How important is filter recall vs. precision?
- Is there a **PII / safety** requirement (remove vs. redact), or is the corpus already cleared as publicly available?
- What does the **output schema** need to include — cleaned text only, or also provenance (`doc_id` / `url` / cleaning version / hash)?
- Which **upstream fields can I trust** (e.g. an existing `language` label), versus signals I should recompute from `text`? What are the hardware limits (RAM, cores, runtime) the script must respect?
### What a Strong Answer Covers
- **Streaming, memory-bounded IO:** batched Parquet read + incremental write, peak memory stated and independent of corpus size; only needed columns materialized.
- **A taxonomy of web-corpus failure modes** mapped to concrete, *named* filters — not a vague "remove bad data."
- **Conservative text normalization:** Unicode normalization, mojibake repair, whitespace/control-char handling — *without* destroying linguistic signal (casing and meaningful structure preserved).
- **Explainable filters**, each returning a decision *and* a reason: length bounds, language, boilerplate/navigation, spam/SEO, repeated-token/low-unique-ratio, malformed/low-information, code-only, optional PII.
- **Exact vs. near deduplication** treated as distinct problems, with the right data structures and an honest account of the one-pass / memory limits.
- **Configurability & observability:** thresholds and toggles via CLI/config; nothing hard-coded silently; per-filter drop counts plus aggregates (length distribution, dup rate, throughput).
- **A validation plan that ties cleaning to model quality:** data-level inspection, small-scale training, ablations comparing raw vs. cleaned at equal token budget.
- **Trade-off awareness:** precision vs. recall, one-pass vs. global statistics, single-machine vs. distributed.
### Follow-up Questions
- Your exact-dedup hash set grows with the number of *kept* documents. At 10× the data it no longer fits in RAM — how do you keep one-pass dedup memory-bounded (disk-backed KV store, Bloom filter, hash-prefix sharding)? What correctness do you trade away?
- A purely one-pass script cannot compute **global** statistics — corpus-wide duplication rate, domain balance, length percentiles — before deciding what to keep. Which of your thresholds secretly need a global view, and how would you restructure into a two-pass or distributed (Spark/Ray/Beam) job without abandoning streaming?
- Aggressive boilerplate/spam filtering shows a **2% drop on a downstream benchmark**. How do you diagnose whether a filter is removing genuinely useful documents, and how would you tune or replace the heuristics with a **learned quality classifier** (and where would the training labels come from)?
- How do you make the pipeline **reproducible and auditable** across reruns and config changes (versioned configs, deterministic hashing, a rejected-sample sidecar for inspection)?
Quick Answer: This question evaluates a candidate's competence in large-scale data engineering and system design, covering memory-bounded streaming IO, explainable per-document filtering, deduplication strategies, language detection, and metrics-driven data quality for language-model pretraining, and falls under the System Design / Data Engineering domain.