PracHub
QuestionsCoachesLearningGuidesInterview Prep
|Home/Data Manipulation (SQL/Python)/Scale AI

Debug ML pipeline and build text parser

Last updated: Jun 25, 2026

Quick Overview

This question evaluates practical machine learning engineering skills, including debugging unfamiliar codebases, parsing noisy text data, and identifying subtle defects such as train/test leakage or shape mismatches. It tests applied Python proficiency and systematic reasoning under time pressure, core competencies assessed in ML engineer technical interviews.

  • Medium
  • Scale AI
  • Data Manipulation (SQL/Python)
  • Machine Learning Engineer

Debug ML pipeline and build text parser

Company: Scale AI

Role: Machine Learning Engineer

Category: Data Manipulation (SQL/Python)

Difficulty: Medium

Interview Round: Technical Screen

- Given raw text files with noisy formatting, implement a robust parser that outputs structured examples; handle delimiters, quoting/escaping, encodings/Unicode, missing fields, and malformed lines, and describe how you would test it. - In a provided ML project (data loading, preprocessing, training, evaluation), identify and fix three defects (e.g., index off-by-one in tokenization, train/test leakage, incorrect loss reduction, nondeterministic seeding, or shape mismatches). Explain your rapid debugging approach (stack traces, assertions, binary search logging, minimal repros). - Describe how you would validate the fixes under a 60-minute time limit (unit tests, end-to-end run, metrics sanity checks, and regression guards).

Quick Answer: This question evaluates practical machine learning engineering skills, including debugging unfamiliar codebases, parsing noisy text data, and identifying subtle defects such as train/test leakage or shape mismatches. It tests applied Python proficiency and systematic reasoning under time pressure, core competencies assessed in ML engineer technical interviews.

Solution

This is a hands-on ML-engineering session, not a theory quiz. Three things are measured: whether you can write a *defensive* parser that survives real-world dirty data, whether you can *triage and fix bugs fast* in unfamiliar code, and whether you can *prove* a fix is correct under a clock. The interviewer cares less about elegant code than about how you reason under time pressure and how you communicate trade-offs. Below is a structured approach to each part with code you can actually type. ## Scoping the session first Before any code, state the contract out loud — it signals engineering maturity and prevents over-fitting to one example: - **Parser**: what exact schema do I emit (field names, required/optional, types)? What delimiter/quoting convention? What do you want done with malformed/missing rows — drop, quarantine, or repair? - **ML project**: is there a known failing symptom, or do I find all three defects cold? Is there an expected metric range and a tiny data subset for fast iteration? - **Constraints**: minimal-diff fixes (no refactor)? Should I add tests, or just demonstrate? --- ## Part 1 — A robust text parser ### Frame the contract - **Input**: line-oriented text, a configurable delimiter (default `\t` or `,`), optional quoting, unknown/mixed encoding, unknown rate of malformed rows. - **Output**: a stream of structured records (`dict` / `dataclass`) **plus** a side channel of rejected lines with reasons. Never silently drop data. - **Principle**: be liberal in what you accept, strict in what you emit, **observable** in what you drop — every skipped line is counted and logged. A parser that discards 30% of a dataset without telling you is a data-quality incident waiting to happen. ### Reuse vs. hand-roll Say this explicitly — it's a senior signal. For comma/tab data with quoting and escaping, Python's `csv` module already implements RFC 4180 correctly (doubled quotes inside quoted fields, embedded delimiters/newlines). **Reaching for `csv.reader` first is the right instinct** — re-implementing quote handling by hand is where bugs live. Hand-roll only for genuinely non-CSV formats (fixed-width, custom escapes, or multi-char delimiters `csv` can't express). ### Encoding strategy Encoding is the most common silent corruption source: 1. **Open in binary**, sniff a BOM (`utf-8-sig` handles UTF-8 BOM), then attempt strict `utf-8`. 2. On `UnicodeDecodeError`, fall back to a permissive decode (`latin-1` never raises) — but **record that the fallback fired** so you know the file is suspect. 3. **Normalize** with `unicodedata.normalize("NFC", ...)` so visually identical strings compare equal; strip a stray mid-stream BOM. Decide deliberately whether to strip zero-width / control characters. ```python import csv, io, logging, math, unicodedata from dataclasses import dataclass, field logger = logging.getLogger("parser") def decode_bytes(raw: bytes) -> tuple[str, bool]: """Return (text, used_fallback).""" for enc in ("utf-8-sig", "utf-8"): # utf-8-sig strips a leading file BOM try: return raw.decode(enc), False except UnicodeDecodeError: continue return raw.decode("latin-1", errors="replace"), True # never raises; flag it def clean_cell(s: str) -> str: s = unicodedata.normalize("NFC", s) # decode_bytes already strips a leading file BOM via utf-8-sig; this lstrip # only catches a stray BOM that survived mid-stream (after a delimiter). return s.strip().lstrip("") ``` ### The parser core ```python @dataclass class ParseReport: total: int = 0 # data rows seen (header NOT counted here) ok: int = 0 blanks: int = 0 rejected: list[tuple[int, str, str]] = field(default_factory=list) # (lineno, reason, raw) encoding_fallback: bool = False REQUIRED = ("id", "text", "label") def fields_to_str(fields: list[str]) -> str: return "\t".join(fields)[:200] # truncate so the log doesn't explode def parse(raw: bytes, delimiter: str = "\t") -> tuple[list[dict], ParseReport]: text, fallback = decode_bytes(raw) report = ParseReport(encoding_fallback=fallback) rows: list[dict] = [] reader = csv.reader( io.StringIO(text, newline=""), # let csv own newline handling delimiter=delimiter, quotechar='"', doublequote=True, # "" -> " inside quoted fields skipinitialspace=True, ) header = None for lineno, fields in enumerate(reader, start=1): if not fields or all(c.strip() == "" for c in fields): report.blanks += 1 continue # blank line: expected noise, skip if header is None: # first non-blank row is the header header = [clean_cell(c) for c in fields] missing = [c for c in REQUIRED if c not in header] if missing: raise ValueError(f"header missing required columns: {missing}") continue report.total += 1 # count only data rows cells = [clean_cell(c) for c in fields] if len(cells) != len(header): # wrong arity -> quarantine, don't crash report.rejected.append((lineno, f"arity {len(cells)}!={len(header)}", fields_to_str(fields))) continue record = {k: (v if v != "" else None) for k, v in zip(header, cells)} missing_required = [k for k in REQUIRED if record.get(k) is None] if missing_required: report.rejected.append((lineno, f"missing {missing_required}", fields_to_str(fields))) continue rows.append(record) report.ok += 1 # Data-loss invariant: every counted data row is accounted for. assert report.ok + len(report.rejected) == report.total if report.rejected: logger.warning("parser rejected %d/%d data lines", len(report.rejected), report.total) return rows, report ``` Note the accounting choice: the header `continue`s **before** `report.total += 1`, and blanks go to their own bucket, so the clean invariant `ok + rejected == total` holds over data rows only. (The earlier-instinct version that counted the header in `total` needs a `+1` fudge in the invariant — avoid that by not counting the header.) ### Edge cases to call out The interviewer listens for these even if you don't code them all: - **Quoted fields containing the delimiter or a newline** — handled by `csv` because it tracks quote state across physical lines; a naive `line.split(delimiter)` breaks here. This is the single biggest reason to use `csv`. - **Embedded `""` escape vs. backslash escape** — pick one (`doublequote=True` for RFC 4180; switch to `escapechar="\\", doublequote=False` if the data uses `\"`). - **Ragged rows** (too few/many columns) — quarantine, don't pad-and-pray. - **Mixed line endings** (`\r\n`, `\n`, lone `\r`) — `csv` with `newline=""` normalizes these. - **Trailing delimiter** producing a phantom empty field; **duplicate/whitespace-padded headers** — normalize and detect collisions. - **Empty / header-only / all-malformed file** — return cleanly with a report; never index into `[0]`. - **Untrusted input** — cap field length / row count to avoid a DoS (decompression-bomb-style blowup). ### How I'd test it A layered plan: 1. **Unit fixtures, one per edge case**: quoted-delimiter row, doubled-quote row, ragged row, missing-required row, BOM-prefixed file, `latin-1`-only file, empty file, mixed line endings. Assert both the parsed records **and** the `ParseReport` (counts + reasons) — the report is part of the contract. 2. **Property / round-trip test**: generate random valid records, serialize with the matching writer, parse back, assert equality. Hypothesis-style fuzzing surfaces quoting bugs you'd never hand-write. 3. **Golden-file test** on an anonymized real sample with known-good output checked in, so regressions show in a diff. 4. **Adversarial fuzz**: feed random bytes; only acceptable outcomes are "parsed" or "cleanly rejected with a reason" — never an uncaught exception. 5. **Invariant assertion**: `ok + rejected == total` — if it fails, the parser is losing data. --- ## Part 2 — Find and fix three defects fast The skill is *speed in unfamiliar code*, so lead with method, then map it onto the specific bug classes the prompt hints at. ### Rapid debugging method (say this before touching code) 1. **Run it first.** Reproduce before reading — a stack trace points at the failing module in seconds and beats top-to-bottom reading. 2. **Read the trace bottom-up**, jump to the deepest frame *in the project's own code* (skip library frames). 3. **Binary-search the pipeline.** It's a linear chain: `load → tokenize → batch → train → eval`. Drop cheap assertions/prints at boundaries (a shape, a dtype, a label distribution, a seed) and bisect to the first stage where an invariant breaks — finds the *stage* in 2–3 probes. 4. **Assert invariants, don't eyeball.** `assert x.shape == (B, T)`, `assert set(train_ids).isdisjoint(test_ids)`, `assert loss.requires_grad` — turns a silent wrong-number bug into a loud, located failure. 5. **Build a minimal repro.** Shrink to a 2-row batch and 2 steps so each experiment runs in seconds. Iteration speed dominates total fix time. 6. **Distinguish crashes from silent wrongness.** Shape mismatches and off-by-ones usually *crash* (fast). Leakage, wrong reduction, and bad seeding *don't* — they produce plausible-but-wrong metrics. Budget time to actively hunt the silent class. ### The three defects — signature and fix **Bug A — off-by-one in tokenization / indexing.** *Symptom*: `IndexError`, a shifted/all-zero feature, or vocab size off by one. *Cause*: misaligning next-token input vs. label, or an id `== vocab_size` indexing past the embedding table. ```python # Buggy: input and target mis-shifted (same slice). input_ids = tokens[:-1] target_ids = tokens[:-1] # BUG: should be tokens[1:] ``` ```python # Fix: target is input shifted by one; guard the id range. input_ids = tokens[:-1] target_ids = tokens[1:] assert max(input_ids) < vocab_size, "token id out of range — off-by-one in vocab build" ``` Catch it with an alignment assertion and by `tokenizer.decode`-ing one example to see input/target are off by exactly one position. **Bug B — train/test leakage.** *Symptom*: validation metrics suspiciously high that *don't* match a fresh holdout. *Causes*: (a) `fit`-ting a scaler/vocabulary/`TfidfVectorizer` on the *full* dataset before splitting; (b) splitting *after* dedup so near-duplicates land in both sets; (c) splitting rows from the same user/document into both sets (group leakage). ```python # Buggy: scaler sees test rows -> leakage. X = scaler.fit_transform(X_all) X_train, X_test, y_train, y_test = train_test_split(X, y) ``` ```python # Fix: split first, fit on train only, transform test with the fitted stats. X_train, X_test, y_train, y_test = train_test_split(X_all, y, random_state=42) X_train = scaler.fit_transform(X_train) X_test = scaler.transform(X_test) # transform only — never refit # Use GroupShuffleSplit when rows share an entity (user/doc) to stop group leakage. ``` *Silent* — no exception. Find it by asking "did any transform see test data?" and checking `assert set(train_ids).isdisjoint(test_ids)`. **Bug C — incorrect loss reduction.** *Symptom*: loss that won't decrease, explodes, or scales with batch size; or padding dominating the loss. *Causes*: summing instead of averaging (effective LR scales with batch size), forgetting to mask padding, or double-applying softmax (passing probabilities into `CrossEntropyLoss`, which expects raw logits). ```python # Buggy: pad tokens counted, and sum makes loss batch-size-dependent. loss = F.cross_entropy(logits.view(-1, V), targets.view(-1), reduction="sum") ``` ```python # Fix: mean reduction + ignore padding -> per-token, stable. loss = F.cross_entropy( logits.view(-1, V), targets.view(-1), ignore_index=PAD_ID, reduction="mean", ) # Sanity: an untrained model over V classes starts near ln(V). assert abs(loss.item() - math.log(V)) < 1.0, "initial loss far from ln(V) — check reduction/logits" ``` Also silent. The $\ln(V)$ initial-loss check is a fast, model-agnostic tripwire for both wrong reduction and a double-softmax. **Two more from the menu, in case those are the planted ones:** - **Nondeterministic seeding.** *Symptom*: results change run-to-run; you can't confirm "the fix worked." *Fix*: seed *all* RNGs in scope — `random.seed`, `np.random.seed`, `torch.manual_seed`, set `torch.backends.cudnn.deterministic = True`, and seed the `DataLoader` `worker_init_fn` + a `generator` — *before* any shuffling or weight init. - **Shape mismatch.** *Symptom*: a loud `RuntimeError: size mismatch`, or a *silent* broadcast (`(B,1)` vs `(B,)` broadcasting to `(B,B)` in a loss). *Fix*: assert exact shapes at batch boundaries; `.squeeze(-1)` / `.unsqueeze(-1)` deliberately; never rely on broadcasting in a loss. ### Order of operations under pressure Fix **crashers first** (they block the run and are fast to locate from the trace), then run end-to-end on the minimal repro to surface the **silent** bugs via sanity checks. Don't refactor — smallest correct change per bug, with a one-line note of each fix for the validation step. --- ## Part 3 — Validate the fixes in 60 minutes Verbalize the clock: **~5 min triage/repro, ~25 min fixing, ~25 min validating, ~5 min buffer.** Validation has four layers, cheapest-to-most-expensive. 1. **Targeted unit tests / assertions (~5 min).** One per fix that *fails on the old code and passes on the new* — a test that doesn't fail on the buggy version proves nothing. - Tokenization: assert input/target alignment and `max(id) < vocab_size`. - Leakage: `assert set(train_ids).isdisjoint(test_ids)` and that the scaler was fit only on train. - Loss: assert initial loss $\approx \ln(V)$ and that padding is masked. 2. **End-to-end smoke run (~10 min).** Run the whole pipeline on a tiny subset (a few hundred rows, 2–3 epochs). Success: completes without error, **training loss decreases**, and the model can **overfit a tiny batch (≈8 examples) to near-zero loss** — the single best end-to-end check that model+loss+optimizer are wired correctly. If it can't overfit 8 examples, something is still broken. 3. **Metrics sanity checks (~5 min).** Confirm numbers are *believable*, not just "a number printed": - Accuracy is above the majority-class baseline but not implausibly perfect (perfect val accuracy is a leakage red flag). - Validation on a *fresh* holdout matches the reported metric. If val was inflated by leakage, the gap collapses after the fix — that's the *expected, good* outcome; be ready to explain so a "dropped" metric isn't mistaken for a regression. - Loss starts near $\ln(V)$ and trends down. 4. **Regression guards (~5 min).** Lock the fixes in: - Keep the three unit tests in the suite — they *are* the regression guards. - **Determinism guard**: run the smoke test twice with a fixed seed, assert identical loss to a tolerance — catches a reintroduced seeding bug. - **Leakage guard**: the `isdisjoint` assert lives in the data-prep path so future refactors fail loudly. - Note any fix you couldn't fully validate, with residual risk — honesty about what's *not* covered is itself a strong signal. ### What good looks like vs. pitfalls - **Good**: reproduce before reading; bisect with assertions; fix the smallest thing; validate each fix with a test that failed on the bug; explain *why* a leakage fix lowers val accuracy. - **Pitfalls**: rewriting code you don't understand; "fixing" by re-running until numbers look nice; declaring victory on a metric inflated by the very leakage you were supposed to fix; chasing the loud crash and never hunting the silent bugs; adding zero regression tests. --- ## Addressing the follow-ups - **Rejection rate jumps 0.5% → 12%.** Emit the `ParseReport` as a metric (rejection rate, top reasons, encoding-fallback rate) and alert on a relative jump, not an absolute count. To tell a source change from a parser regression: pin the parser version and re-run *yesterday's* file — if its rejection rate is unchanged, the spike is the new data (source schema/delimiter change); if yesterday's file now also rejects 12%, the parser regressed. The per-reason breakdown localizes it (e.g. a surge of `arity` rejections points to a delimiter change at the source). - **Val accuracy 0.97 → 0.81 after the leakage fix.** Reframe: 0.97 was measuring memorization, not generalization. Demonstrate by scoring the *old* model on a truly fresh, never-seen holdout (or a future time-split) — it collapses too, proving 0.97 never existed in the wild. 0.81 is the *honest* number that will hold in production; shipping 0.97 would have meant a silent regression the day real traffic arrived. - **Standing tripwire suite for silent ML bugs.** Promote the debugging sanity checks into CI: an initial-loss ≈ $\ln(V)$ test, an overfit-tiny-batch test, a determinism (same-seed → same-loss) test, and a data-prep leakage assert (`train_ids` disjoint from `test_ids`, transforms fit only on train). Run them on every PR against a fixed tiny fixture so leakage/wrong-reduction/nondeterminism are caught in minutes, not in a future debugging session. - **Files too large for memory.** Make `parse` a **generator** that `yield`s records (and updates a running `ParseReport`) instead of materializing `rows` — memory drops to $O(1)$ in rows. `csv.reader` already streams over a file object, so pass the file handle instead of an in-memory `StringIO`. Testing changes: add a test asserting constant memory / lazy consumption (e.g. the generator yields the first record before the whole file is read), and keep the golden-file test by consuming the generator into a list in the test only.
|Home/Data Manipulation (SQL/Python)/Scale AI

Debug ML pipeline and build text parser

Scale AI logo
Scale AI
Sep 6, 2025, 12:00 AM
MediumMachine Learning EngineerTechnical ScreenData Manipulation (SQL/Python)
31
0

You are in a hands-on, hour-long ML-engineering working session (Scale AI, Machine Learning Engineer loop). You are given a small ML project — data loading, preprocessing, training, evaluation — that is supposed to train a text classifier/sequence model, but the data is dirty and the code has defects. The interviewer is watching how you work, not just the final diff: how you reason about messy input, how fast you locate bugs in unfamiliar code, and how you prove a fix is correct under a clock.

The session has three parts, in order.

Constraints & Assumptions

  • Time budget : the whole working session is ~60 minutes. Part 3 explicitly asks you to validate within that limit, so you must spend the clock deliberately.
  • Codebase : the provided ML project is "a lot of code" you have not seen before — you cannot read it top-to-bottom in time.
  • Data : the raw text files are line-oriented but noisy — inconsistent delimiters, quoting/escaping, mixed/unknown encodings, missing fields, and malformed lines all appear.
  • Defect count : exactly three defects are planted in the project (Part 2). They are drawn from a known menu: index off-by-one in tokenization, train/test leakage, incorrect loss reduction, nondeterministic seeding, or shape mismatches.
  • Stack : assume a standard Python + (PyTorch-style) ML stack; you may use the standard library and common libraries.

Clarifying Questions to Ask

Scope the whole session up front with the interviewer before writing code:

  • What is the exact record schema the parser should emit (field names, required vs. optional, types), and what is the delimiter/quoting convention of the input files?
  • For malformed or missing-field lines, do you want them dropped, quarantined with a reason, or repaired? Is silently losing rows acceptable?
  • For the ML project: is there a known failing symptom (a crash, a metric that looks too good, non-reproducible results), or do I discover the three defects cold?
  • Is there a ground-truth/expected metric range I should validate against, and is a tiny subset of the data available for fast iteration?
  • Do the three defects need to be fixed with minimal diffs (no refactor), and should I add tests, or just demonstrate the fixes?

Part 1 — Build a robust text parser

Given the raw, noisily-formatted text files, implement a parser that turns them into structured examples. It must handle: configurable delimiters, quoting/escaping, encodings and Unicode, missing fields, and malformed lines — and must not silently discard data. Then describe how you would test it.

What This Part Should Cover

  • Correct handling of quoting/escaping and embedded delimiters/newlines (the case a naive split breaks on), plus mixed line endings and trailing delimiters.
  • A deliberate encoding/Unicode strategy with an observable fallback, not a silent guess.
  • An explicit malformed-line and missing-field policy (quarantine-with-reason, never crash, never silently drop).
  • A layered test plan: unit fixtures per edge case (asserting the report, not just the records), round-trip/property fuzzing, a golden file, and adversarial random-bytes input.

Part 2 — Find and fix three defects fast

In the provided ML project (load → preprocess/tokenize → batch → train → eval), identify and fix three defects, then explain your rapid-debugging approach. There is a lot of unfamiliar code, so the skill being measured is locating the problems quickly, not reading everything.

What This Part Should Cover

  • A repeatable triage method: reproduce first, read the trace bottom-up into project code, bisect with invariants, build a minimal repro for fast iteration.
  • For each defect: the observable symptom, the root cause, and a minimal correct fix (no gratuitous refactor).
  • Awareness that some planted bugs are silent and require proactive sanity checks rather than waiting for a crash.

Part 3 — Validate the fixes in 60 minutes

Describe how you would confirm the three fixes are correct within the time limit, using unit tests, an end-to-end run, metrics sanity checks, and regression guards.

What This Part Should Cover

  • A concrete time allocation across triage, fixing, and validation that fits 60 minutes.
  • One targeted test per fix, each demonstrably failing on the buggy version.
  • An end-to-end smoke run with explicit success criteria (completes, loss decreases, overfits a tiny batch) and metric sanity checks against a baseline.
  • Regression guards (determinism check, leakage assert, kept unit tests) plus honesty about anything left unvalidated.

What a Strong Answer Covers

Across all three parts, the interviewer is reading for engineering judgment under time pressure, not perfect code:

  • Defensive, observable engineering : input/output contracts stated before coding, no silent data loss, invariants asserted rather than eyeballed.
  • Speed of iteration : minimal repros, bisection, and tiny subsets so each experiment runs in seconds — the candidate optimizes total fix time, not line-by-line reading.
  • Distinguishing crashes from silent wrongness and proving correctness with tests that fail on the bug, including explaining why a leakage fix legitimately lowers validation accuracy.
  • Communication : verbalizing trade-offs, time allocation, and residual risk — treating the session as collaboration, not a silent coding exam.

Follow-up Questions

  • Your parser quarantines malformed lines. The rejection rate jumps from 0.5% to 12% overnight in production. How do you alert on this, and how do you decide whether it's a data-source change or a parser regression?
  • You fixed train/test leakage and validation accuracy dropped from 0.97 to 0.81. How do you convince a skeptical stakeholder this is an improvement , not a regression?
  • The three bugs included a silent one you found only via the ln⁡(V)\ln(V)ln(V) loss check. How would you build a standing "tripwire" suite so silent ML bugs (leakage, wrong reduction, nondeterminism) are caught in CI rather than in a debugging session?
  • For very large input files that don't fit in memory, how would you restructure the parser, and what changes in your testing strategy?
Loading comments...

Browse More Questions

More Data Manipulation (SQL/Python)•More Scale AI•More Machine Learning Engineer•Scale AI Machine Learning Engineer•Scale AI Data Manipulation (SQL/Python)•Machine Learning Engineer Data Manipulation (SQL/Python)

Write your answer

Your first approved answer each day earns 20 XP.

Sign in to write your answer.
PracHub

Master your tech interviews with 8,000+ real questions from top companies.

Product

  • Questions
  • Learning Tracks
  • Interview Guides
  • Resources
  • Premium
  • For Universities
  • Student Access

Browse

  • By Company
  • By Role
  • By Category
  • Topic Hubs
  • SQL Questions
  • AI Coding Questions
  • Compare Platforms
  • Discord Community

Support

  • support@prachub.com
  • (916) 541-4762

Legal

  • Privacy Policy
  • Terms of Service
  • About Us

© 2026 PracHub. All rights reserved.