PracHub
QuestionsPremiumCoachesLearningGuidesInterview Prep
|Home/System Design/xAI

Build a One-Pass Data Cleaning Pipeline

Last updated: Jun 17, 2026

Quick Overview

This question evaluates a candidate's competence in large-scale data engineering and system design, covering memory-bounded streaming IO, explainable per-document filtering, deduplication strategies, language detection, and metrics-driven data quality for language-model pretraining, and falls under the System Design / Data Engineering domain.

  • medium
  • xAI
  • System Design
  • Data Engineer

Build a One-Pass Data Cleaning Pipeline

Company: xAI

Role: Data Engineer

Category: System Design

Difficulty: medium

Interview Round: Technical Screen

You are given one or more large Parquet files containing **millions of publicly available, web-scraped documents**. The documents will later be tokenized and used to **pretrain a language model**, and that model will be evaluated on a standard benchmark suite. Your cleaning and filtering decisions directly determine downstream model quality. Design and implement a **data cleaning pipeline as a single Python script** that ingests the Parquet input, removes low-quality data in **one pass** without loading the whole corpus into memory, and writes a cleaned Parquet dataset suitable for LM training. The pipeline must: - Read the Parquet input efficiently and process it in a single streaming pass. - Detect and filter the common web-scrape failure modes — empty / too-short / too-long documents, boilerplate- and navigation-heavy pages, malformed or mojibake text, non-target-language content, spam / SEO pages, repeated-token gibberish, code-only pages, and exact or near-duplicate documents. - Make the cleaning logic **explainable and configurable**: thresholds are parameters, and every dropped document is attributable to a specific named filter. - Emit **logging / metrics** showing how many documents each filter removed. - Output valid Parquet the hiring team can tokenize and train on directly. This is a ~48-hour take-home followed by a short (~15 min) live walkthrough on your design choices, trade-offs, and how you would validate that the cleaned corpus actually improves model quality. ```hint Where to start Separate the three concerns this pipeline really has: moving bytes in and out (IO), judging a single document on its own (per-document filters), and reasoning about a document relative to others already seen (dedup). They have very different memory and testing profiles. The "configurable + every drop is attributable" requirement falls out naturally if each filter is a pure function returning `(keep, reason)`. ``` ```hint Memory-bounded IO The binding constraint is that peak memory must not grow with the corpus. Read the Parquet input *incrementally* — `pyarrow.dataset.to_batches()` (row groups) or a `polars` lazy/streaming scan — and write output incrementally with a `ParquetWriter`. Then state your peak working set in one sentence: if it doesn't mention the corpus size, you're on the right track. ``` ```hint Order filters cheap-to-expensive Some signals are nearly free (length, character-class ratios); others are expensive (language ID, dedup index lookups). Run the cheap ones first and short-circuit, so expensive checks only touch survivors. Note one ordering trap: language detection is unreliable on very short text, so it must run *after* the minimum-length filter. ``` ```hint Deduplication is the part with real depth "Same document" means two different things needing different machinery. **Exact** dedup = hash the normalized text into a `set`/disk-backed store (single pass, O(1) membership). **Near** dedup = MinHash + LSH over token shingles (Jaccard) or SimHash + Hamming distance. The hard part under a one-pass, memory-limited constraint is that the near-dup index grows with kept docs — say honestly how you bound it and what you defer to an offline/distributed job. ``` ```hint Don't over-filter The objective is *model quality*, not maximal removal. Aggressive filters discard rare-but-valuable data (technical text, math, code, multilingual). Prefer ratios over absolute counts, make every threshold a config knob, and design so you can **ablate** filters to measure each one's effect on the trained model. ``` ### Constraints & Assumptions - **Scale:** millions of documents; total size can exceed available RAM (assume tens of millions of docs / hundreds of GB is plausible). Peak memory must be independent of corpus size. - **One pass, single machine, single script:** read the input once, stream row groups, write accepted rows incrementally. The deliverable is one Python file — but you should be able to explain how it scales out. - **Schema:** one document per row, with at least a `text` column. Other fields (`doc_id`, `url`, `title`, `language`, crawl `timestamp`) are optional and may be missing or null. - **Target language** is configurable (assume English unless told otherwise). - **Determinism & reproducibility:** same input + same config produces the same output; the config (thresholds, version) is recorded with the output for audit. - **Latency budget:** offline/batch — throughput and memory matter; single-document latency does not. The hiring team controls tokenization, training, and evaluation downstream. ### Clarifying Questions to Ask - What is the **total corpus size** (row count and on-disk bytes), and how many input files / row groups? This decides whether an in-memory hash set for exact dedup is viable or needs a disk-backed store. - What is the **target language (or languages)**, and should code, math, tables, and logs be kept or dropped for this model? - Is **near-duplicate** removal required for this round, or is exact dedup sufficient with near-dup discussed as future work? How important is filter recall vs. precision? - Is there a **PII / safety** requirement (remove vs. redact), or is the corpus already cleared as publicly available? - What does the **output schema** need to include — cleaned text only, or also provenance (`doc_id` / `url` / cleaning version / hash)? - Which **upstream fields can I trust** (e.g. an existing `language` label), versus signals I should recompute from `text`? What are the hardware limits (RAM, cores, runtime) the script must respect? ### What a Strong Answer Covers - **Streaming, memory-bounded IO:** batched Parquet read + incremental write, peak memory stated and independent of corpus size; only needed columns materialized. - **A taxonomy of web-corpus failure modes** mapped to concrete, *named* filters — not a vague "remove bad data." - **Conservative text normalization:** Unicode normalization, mojibake repair, whitespace/control-char handling — *without* destroying linguistic signal (casing and meaningful structure preserved). - **Explainable filters**, each returning a decision *and* a reason: length bounds, language, boilerplate/navigation, spam/SEO, repeated-token/low-unique-ratio, malformed/low-information, code-only, optional PII. - **Exact vs. near deduplication** treated as distinct problems, with the right data structures and an honest account of the one-pass / memory limits. - **Configurability & observability:** thresholds and toggles via CLI/config; nothing hard-coded silently; per-filter drop counts plus aggregates (length distribution, dup rate, throughput). - **A validation plan that ties cleaning to model quality:** data-level inspection, small-scale training, ablations comparing raw vs. cleaned at equal token budget. - **Trade-off awareness:** precision vs. recall, one-pass vs. global statistics, single-machine vs. distributed. ### Follow-up Questions - Your exact-dedup hash set grows with the number of *kept* documents. At 10× the data it no longer fits in RAM — how do you keep one-pass dedup memory-bounded (disk-backed KV store, Bloom filter, hash-prefix sharding)? What correctness do you trade away? - A purely one-pass script cannot compute **global** statistics — corpus-wide duplication rate, domain balance, length percentiles — before deciding what to keep. Which of your thresholds secretly need a global view, and how would you restructure into a two-pass or distributed (Spark/Ray/Beam) job without abandoning streaming? - Aggressive boilerplate/spam filtering shows a **2% drop on a downstream benchmark**. How do you diagnose whether a filter is removing genuinely useful documents, and how would you tune or replace the heuristics with a **learned quality classifier** (and where would the training labels come from)? - How do you make the pipeline **reproducible and auditable** across reruns and config changes (versioned configs, deterministic hashing, a rejected-sample sidecar for inspection)?

Quick Answer: This question evaluates a candidate's competence in large-scale data engineering and system design, covering memory-bounded streaming IO, explainable per-document filtering, deduplication strategies, language detection, and metrics-driven data quality for language-model pretraining, and falls under the System Design / Data Engineering domain.

Related Interview Questions

  • Design a multi-level API rate limiter - xAI (easy)
  • Design a backend for an online checkers game - xAI (medium)
  • Design a follower push-notification system - xAI (hard)
  • Design a schema for server engagement - xAI (medium)
  • Design backend to score and classify tweets - xAI (medium)
xAI logo
xAI
May 30, 2026, 12:00 AM
Data Engineer
Technical Screen
System Design
2
0

You are given one or more large Parquet files containing millions of publicly available, web-scraped documents. The documents will later be tokenized and used to pretrain a language model, and that model will be evaluated on a standard benchmark suite. Your cleaning and filtering decisions directly determine downstream model quality.

Design and implement a data cleaning pipeline as a single Python script that ingests the Parquet input, removes low-quality data in one pass without loading the whole corpus into memory, and writes a cleaned Parquet dataset suitable for LM training. The pipeline must:

  • Read the Parquet input efficiently and process it in a single streaming pass.
  • Detect and filter the common web-scrape failure modes — empty / too-short / too-long documents, boilerplate- and navigation-heavy pages, malformed or mojibake text, non-target-language content, spam / SEO pages, repeated-token gibberish, code-only pages, and exact or near-duplicate documents.
  • Make the cleaning logic explainable and configurable : thresholds are parameters, and every dropped document is attributable to a specific named filter.
  • Emit logging / metrics showing how many documents each filter removed.
  • Output valid Parquet the hiring team can tokenize and train on directly.

This is a ~48-hour take-home followed by a short (~15 min) live walkthrough on your design choices, trade-offs, and how you would validate that the cleaned corpus actually improves model quality.

Constraints & Assumptions

  • Scale: millions of documents; total size can exceed available RAM (assume tens of millions of docs / hundreds of GB is plausible). Peak memory must be independent of corpus size.
  • One pass, single machine, single script: read the input once, stream row groups, write accepted rows incrementally. The deliverable is one Python file — but you should be able to explain how it scales out.
  • Schema: one document per row, with at least a text column. Other fields ( doc_id , url , title , language , crawl timestamp ) are optional and may be missing or null.
  • Target language is configurable (assume English unless told otherwise).
  • Determinism & reproducibility: same input + same config produces the same output; the config (thresholds, version) is recorded with the output for audit.
  • Latency budget: offline/batch — throughput and memory matter; single-document latency does not. The hiring team controls tokenization, training, and evaluation downstream.

Clarifying Questions to Ask

  • What is the total corpus size (row count and on-disk bytes), and how many input files / row groups? This decides whether an in-memory hash set for exact dedup is viable or needs a disk-backed store.
  • What is the target language (or languages) , and should code, math, tables, and logs be kept or dropped for this model?
  • Is near-duplicate removal required for this round, or is exact dedup sufficient with near-dup discussed as future work? How important is filter recall vs. precision?
  • Is there a PII / safety requirement (remove vs. redact), or is the corpus already cleared as publicly available?
  • What does the output schema need to include — cleaned text only, or also provenance ( doc_id / url / cleaning version / hash)?
  • Which upstream fields can I trust (e.g. an existing language label), versus signals I should recompute from text ? What are the hardware limits (RAM, cores, runtime) the script must respect?

What a Strong Answer Covers

  • Streaming, memory-bounded IO: batched Parquet read + incremental write, peak memory stated and independent of corpus size; only needed columns materialized.
  • A taxonomy of web-corpus failure modes mapped to concrete, named filters — not a vague "remove bad data."
  • Conservative text normalization: Unicode normalization, mojibake repair, whitespace/control-char handling — without destroying linguistic signal (casing and meaningful structure preserved).
  • Explainable filters , each returning a decision and a reason: length bounds, language, boilerplate/navigation, spam/SEO, repeated-token/low-unique-ratio, malformed/low-information, code-only, optional PII.
  • Exact vs. near deduplication treated as distinct problems, with the right data structures and an honest account of the one-pass / memory limits.
  • Configurability & observability: thresholds and toggles via CLI/config; nothing hard-coded silently; per-filter drop counts plus aggregates (length distribution, dup rate, throughput).
  • A validation plan that ties cleaning to model quality: data-level inspection, small-scale training, ablations comparing raw vs. cleaned at equal token budget.
  • Trade-off awareness: precision vs. recall, one-pass vs. global statistics, single-machine vs. distributed.

Follow-up Questions

  • Your exact-dedup hash set grows with the number of kept documents. At 10× the data it no longer fits in RAM — how do you keep one-pass dedup memory-bounded (disk-backed KV store, Bloom filter, hash-prefix sharding)? What correctness do you trade away?
  • A purely one-pass script cannot compute global statistics — corpus-wide duplication rate, domain balance, length percentiles — before deciding what to keep. Which of your thresholds secretly need a global view, and how would you restructure into a two-pass or distributed (Spark/Ray/Beam) job without abandoning streaming?
  • Aggressive boilerplate/spam filtering shows a 2% drop on a downstream benchmark . How do you diagnose whether a filter is removing genuinely useful documents, and how would you tune or replace the heuristics with a learned quality classifier (and where would the training labels come from)?
  • How do you make the pipeline reproducible and auditable across reruns and config changes (versioned configs, deterministic hashing, a rejected-sample sidecar for inspection)?

Solution

Show

Submit Your Answer to Earn 20XP

Sign in to leave a comment

Loading comments...

Browse More Questions

More System Design•More xAI•More Data Engineer•xAI Data Engineer•xAI System Design•Data Engineer System Design
PracHub

Master your tech interviews with 8,000+ real questions from top companies.

Product

  • Questions
  • Learning Tracks
  • Interview Guides
  • Resources
  • Premium
  • For Universities
  • Student Access

Browse

  • By Company
  • By Role
  • By Category
  • Topic Hubs
  • SQL Questions
  • Compare Platforms
  • Discord Community

Support

  • support@prachub.com
  • (916) 541-4762

Legal

  • Privacy Policy
  • Terms of Service
  • About Us

© 2026 PracHub. All rights reserved.