- Given raw text files with noisy formatting, implement a robust parser that outputs structured examples; handle delimiters, quoting/escaping, encodings/Unicode, missing fields, and malformed lines, and describe how you would test it.
- In a provided ML project (data loading, preprocessing, training, evaluation), identify and fix three defects (e.g., index off-by-one in tokenization, train/test leakage, incorrect loss reduction, nondeterministic seeding, or shape mismatches). Explain your rapid debugging approach (stack traces, assertions, binary search logging, minimal repros).
- Describe how you would validate the fixes under a 60-minute time limit (unit tests, end-to-end run, metrics sanity checks, and regression guards).
Quick Answer: This question evaluates practical machine learning engineering skills, including debugging unfamiliar codebases, parsing noisy text data, and identifying subtle defects such as train/test leakage or shape mismatches. It tests applied Python proficiency and systematic reasoning under time pressure, core competencies assessed in ML engineer technical interviews.
Solution
This is a hands-on ML-engineering session, not a theory quiz. Three things are measured: whether you can write a *defensive* parser that survives real-world dirty data, whether you can *triage and fix bugs fast* in unfamiliar code, and whether you can *prove* a fix is correct under a clock. The interviewer cares less about elegant code than about how you reason under time pressure and how you communicate trade-offs. Below is a structured approach to each part with code you can actually type.
## Scoping the session first
Before any code, state the contract out loud — it signals engineering maturity and prevents over-fitting to one example:
- **Parser**: what exact schema do I emit (field names, required/optional, types)? What delimiter/quoting convention? What do you want done with malformed/missing rows — drop, quarantine, or repair?
- **ML project**: is there a known failing symptom, or do I find all three defects cold? Is there an expected metric range and a tiny data subset for fast iteration?
- **Constraints**: minimal-diff fixes (no refactor)? Should I add tests, or just demonstrate?
---
## Part 1 — A robust text parser
### Frame the contract
- **Input**: line-oriented text, a configurable delimiter (default `\t` or `,`), optional quoting, unknown/mixed encoding, unknown rate of malformed rows.
- **Output**: a stream of structured records (`dict` / `dataclass`) **plus** a side channel of rejected lines with reasons. Never silently drop data.
- **Principle**: be liberal in what you accept, strict in what you emit, **observable** in what you drop — every skipped line is counted and logged. A parser that discards 30% of a dataset without telling you is a data-quality incident waiting to happen.
### Reuse vs. hand-roll
Say this explicitly — it's a senior signal. For comma/tab data with quoting and escaping, Python's `csv` module already implements RFC 4180 correctly (doubled quotes inside quoted fields, embedded delimiters/newlines). **Reaching for `csv.reader` first is the right instinct** — re-implementing quote handling by hand is where bugs live. Hand-roll only for genuinely non-CSV formats (fixed-width, custom escapes, or multi-char delimiters `csv` can't express).
### Encoding strategy
Encoding is the most common silent corruption source:
1. **Open in binary**, sniff a BOM (`utf-8-sig` handles UTF-8 BOM), then attempt strict `utf-8`.
2. On `UnicodeDecodeError`, fall back to a permissive decode (`latin-1` never raises) — but **record that the fallback fired** so you know the file is suspect.
3. **Normalize** with `unicodedata.normalize("NFC", ...)` so visually identical strings compare equal; strip a stray mid-stream BOM. Decide deliberately whether to strip zero-width / control characters.
```python
import csv, io, logging, math, unicodedata
from dataclasses import dataclass, field
logger = logging.getLogger("parser")
def decode_bytes(raw: bytes) -> tuple[str, bool]:
"""Return (text, used_fallback)."""
for enc in ("utf-8-sig", "utf-8"): # utf-8-sig strips a leading file BOM
try:
return raw.decode(enc), False
except UnicodeDecodeError:
continue
return raw.decode("latin-1", errors="replace"), True # never raises; flag it
def clean_cell(s: str) -> str:
s = unicodedata.normalize("NFC", s)
# decode_bytes already strips a leading file BOM via utf-8-sig; this lstrip
# only catches a stray BOM that survived mid-stream (after a delimiter).
return s.strip().lstrip("")
```
### The parser core
```python
@dataclass
class ParseReport:
total: int = 0 # data rows seen (header NOT counted here)
ok: int = 0
blanks: int = 0
rejected: list[tuple[int, str, str]] = field(default_factory=list) # (lineno, reason, raw)
encoding_fallback: bool = False
REQUIRED = ("id", "text", "label")
def fields_to_str(fields: list[str]) -> str:
return "\t".join(fields)[:200] # truncate so the log doesn't explode
def parse(raw: bytes, delimiter: str = "\t") -> tuple[list[dict], ParseReport]:
text, fallback = decode_bytes(raw)
report = ParseReport(encoding_fallback=fallback)
rows: list[dict] = []
reader = csv.reader(
io.StringIO(text, newline=""), # let csv own newline handling
delimiter=delimiter,
quotechar='"',
doublequote=True, # "" -> " inside quoted fields
skipinitialspace=True,
)
header = None
for lineno, fields in enumerate(reader, start=1):
if not fields or all(c.strip() == "" for c in fields):
report.blanks += 1
continue # blank line: expected noise, skip
if header is None: # first non-blank row is the header
header = [clean_cell(c) for c in fields]
missing = [c for c in REQUIRED if c not in header]
if missing:
raise ValueError(f"header missing required columns: {missing}")
continue
report.total += 1 # count only data rows
cells = [clean_cell(c) for c in fields]
if len(cells) != len(header): # wrong arity -> quarantine, don't crash
report.rejected.append((lineno, f"arity {len(cells)}!={len(header)}", fields_to_str(fields)))
continue
record = {k: (v if v != "" else None) for k, v in zip(header, cells)}
missing_required = [k for k in REQUIRED if record.get(k) is None]
if missing_required:
report.rejected.append((lineno, f"missing {missing_required}", fields_to_str(fields)))
continue
rows.append(record)
report.ok += 1
# Data-loss invariant: every counted data row is accounted for.
assert report.ok + len(report.rejected) == report.total
if report.rejected:
logger.warning("parser rejected %d/%d data lines", len(report.rejected), report.total)
return rows, report
```
Note the accounting choice: the header `continue`s **before** `report.total += 1`, and blanks go to their own bucket, so the clean invariant `ok + rejected == total` holds over data rows only. (The earlier-instinct version that counted the header in `total` needs a `+1` fudge in the invariant — avoid that by not counting the header.)
### Edge cases to call out
The interviewer listens for these even if you don't code them all:
- **Quoted fields containing the delimiter or a newline** — handled by `csv` because it tracks quote state across physical lines; a naive `line.split(delimiter)` breaks here. This is the single biggest reason to use `csv`.
- **Embedded `""` escape vs. backslash escape** — pick one (`doublequote=True` for RFC 4180; switch to `escapechar="\\", doublequote=False` if the data uses `\"`).
- **Ragged rows** (too few/many columns) — quarantine, don't pad-and-pray.
- **Mixed line endings** (`\r\n`, `\n`, lone `\r`) — `csv` with `newline=""` normalizes these.
- **Trailing delimiter** producing a phantom empty field; **duplicate/whitespace-padded headers** — normalize and detect collisions.
- **Empty / header-only / all-malformed file** — return cleanly with a report; never index into `[0]`.
- **Untrusted input** — cap field length / row count to avoid a DoS (decompression-bomb-style blowup).
### How I'd test it
A layered plan:
1. **Unit fixtures, one per edge case**: quoted-delimiter row, doubled-quote row, ragged row, missing-required row, BOM-prefixed file, `latin-1`-only file, empty file, mixed line endings. Assert both the parsed records **and** the `ParseReport` (counts + reasons) — the report is part of the contract.
2. **Property / round-trip test**: generate random valid records, serialize with the matching writer, parse back, assert equality. Hypothesis-style fuzzing surfaces quoting bugs you'd never hand-write.
3. **Golden-file test** on an anonymized real sample with known-good output checked in, so regressions show in a diff.
4. **Adversarial fuzz**: feed random bytes; only acceptable outcomes are "parsed" or "cleanly rejected with a reason" — never an uncaught exception.
5. **Invariant assertion**: `ok + rejected == total` — if it fails, the parser is losing data.
---
## Part 2 — Find and fix three defects fast
The skill is *speed in unfamiliar code*, so lead with method, then map it onto the specific bug classes the prompt hints at.
### Rapid debugging method (say this before touching code)
1. **Run it first.** Reproduce before reading — a stack trace points at the failing module in seconds and beats top-to-bottom reading.
2. **Read the trace bottom-up**, jump to the deepest frame *in the project's own code* (skip library frames).
3. **Binary-search the pipeline.** It's a linear chain: `load → tokenize → batch → train → eval`. Drop cheap assertions/prints at boundaries (a shape, a dtype, a label distribution, a seed) and bisect to the first stage where an invariant breaks — finds the *stage* in 2–3 probes.
4. **Assert invariants, don't eyeball.** `assert x.shape == (B, T)`, `assert set(train_ids).isdisjoint(test_ids)`, `assert loss.requires_grad` — turns a silent wrong-number bug into a loud, located failure.
5. **Build a minimal repro.** Shrink to a 2-row batch and 2 steps so each experiment runs in seconds. Iteration speed dominates total fix time.
6. **Distinguish crashes from silent wrongness.** Shape mismatches and off-by-ones usually *crash* (fast). Leakage, wrong reduction, and bad seeding *don't* — they produce plausible-but-wrong metrics. Budget time to actively hunt the silent class.
### The three defects — signature and fix
**Bug A — off-by-one in tokenization / indexing.** *Symptom*: `IndexError`, a shifted/all-zero feature, or vocab size off by one. *Cause*: misaligning next-token input vs. label, or an id `== vocab_size` indexing past the embedding table.
```python
# Buggy: input and target mis-shifted (same slice).
input_ids = tokens[:-1]
target_ids = tokens[:-1] # BUG: should be tokens[1:]
```
```python
# Fix: target is input shifted by one; guard the id range.
input_ids = tokens[:-1]
target_ids = tokens[1:]
assert max(input_ids) < vocab_size, "token id out of range — off-by-one in vocab build"
```
Catch it with an alignment assertion and by `tokenizer.decode`-ing one example to see input/target are off by exactly one position.
**Bug B — train/test leakage.** *Symptom*: validation metrics suspiciously high that *don't* match a fresh holdout. *Causes*: (a) `fit`-ting a scaler/vocabulary/`TfidfVectorizer` on the *full* dataset before splitting; (b) splitting *after* dedup so near-duplicates land in both sets; (c) splitting rows from the same user/document into both sets (group leakage).
```python
# Buggy: scaler sees test rows -> leakage.
X = scaler.fit_transform(X_all)
X_train, X_test, y_train, y_test = train_test_split(X, y)
```
```python
# Fix: split first, fit on train only, transform test with the fitted stats.
X_train, X_test, y_train, y_test = train_test_split(X_all, y, random_state=42)
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test) # transform only — never refit
# Use GroupShuffleSplit when rows share an entity (user/doc) to stop group leakage.
```
*Silent* — no exception. Find it by asking "did any transform see test data?" and checking `assert set(train_ids).isdisjoint(test_ids)`.
**Bug C — incorrect loss reduction.** *Symptom*: loss that won't decrease, explodes, or scales with batch size; or padding dominating the loss. *Causes*: summing instead of averaging (effective LR scales with batch size), forgetting to mask padding, or double-applying softmax (passing probabilities into `CrossEntropyLoss`, which expects raw logits).
```python
# Buggy: pad tokens counted, and sum makes loss batch-size-dependent.
loss = F.cross_entropy(logits.view(-1, V), targets.view(-1), reduction="sum")
```
```python
# Fix: mean reduction + ignore padding -> per-token, stable.
loss = F.cross_entropy(
logits.view(-1, V), targets.view(-1),
ignore_index=PAD_ID, reduction="mean",
)
# Sanity: an untrained model over V classes starts near ln(V).
assert abs(loss.item() - math.log(V)) < 1.0, "initial loss far from ln(V) — check reduction/logits"
```
Also silent. The $\ln(V)$ initial-loss check is a fast, model-agnostic tripwire for both wrong reduction and a double-softmax.
**Two more from the menu, in case those are the planted ones:**
- **Nondeterministic seeding.** *Symptom*: results change run-to-run; you can't confirm "the fix worked." *Fix*: seed *all* RNGs in scope — `random.seed`, `np.random.seed`, `torch.manual_seed`, set `torch.backends.cudnn.deterministic = True`, and seed the `DataLoader` `worker_init_fn` + a `generator` — *before* any shuffling or weight init.
- **Shape mismatch.** *Symptom*: a loud `RuntimeError: size mismatch`, or a *silent* broadcast (`(B,1)` vs `(B,)` broadcasting to `(B,B)` in a loss). *Fix*: assert exact shapes at batch boundaries; `.squeeze(-1)` / `.unsqueeze(-1)` deliberately; never rely on broadcasting in a loss.
### Order of operations under pressure
Fix **crashers first** (they block the run and are fast to locate from the trace), then run end-to-end on the minimal repro to surface the **silent** bugs via sanity checks. Don't refactor — smallest correct change per bug, with a one-line note of each fix for the validation step.
---
## Part 3 — Validate the fixes in 60 minutes
Verbalize the clock: **~5 min triage/repro, ~25 min fixing, ~25 min validating, ~5 min buffer.** Validation has four layers, cheapest-to-most-expensive.
1. **Targeted unit tests / assertions (~5 min).** One per fix that *fails on the old code and passes on the new* — a test that doesn't fail on the buggy version proves nothing.
- Tokenization: assert input/target alignment and `max(id) < vocab_size`.
- Leakage: `assert set(train_ids).isdisjoint(test_ids)` and that the scaler was fit only on train.
- Loss: assert initial loss $\approx \ln(V)$ and that padding is masked.
2. **End-to-end smoke run (~10 min).** Run the whole pipeline on a tiny subset (a few hundred rows, 2–3 epochs). Success: completes without error, **training loss decreases**, and the model can **overfit a tiny batch (≈8 examples) to near-zero loss** — the single best end-to-end check that model+loss+optimizer are wired correctly. If it can't overfit 8 examples, something is still broken.
3. **Metrics sanity checks (~5 min).** Confirm numbers are *believable*, not just "a number printed":
- Accuracy is above the majority-class baseline but not implausibly perfect (perfect val accuracy is a leakage red flag).
- Validation on a *fresh* holdout matches the reported metric. If val was inflated by leakage, the gap collapses after the fix — that's the *expected, good* outcome; be ready to explain so a "dropped" metric isn't mistaken for a regression.
- Loss starts near $\ln(V)$ and trends down.
4. **Regression guards (~5 min).** Lock the fixes in:
- Keep the three unit tests in the suite — they *are* the regression guards.
- **Determinism guard**: run the smoke test twice with a fixed seed, assert identical loss to a tolerance — catches a reintroduced seeding bug.
- **Leakage guard**: the `isdisjoint` assert lives in the data-prep path so future refactors fail loudly.
- Note any fix you couldn't fully validate, with residual risk — honesty about what's *not* covered is itself a strong signal.
### What good looks like vs. pitfalls
- **Good**: reproduce before reading; bisect with assertions; fix the smallest thing; validate each fix with a test that failed on the bug; explain *why* a leakage fix lowers val accuracy.
- **Pitfalls**: rewriting code you don't understand; "fixing" by re-running until numbers look nice; declaring victory on a metric inflated by the very leakage you were supposed to fix; chasing the loud crash and never hunting the silent bugs; adding zero regression tests.
---
## Addressing the follow-ups
- **Rejection rate jumps 0.5% → 12%.** Emit the `ParseReport` as a metric (rejection rate, top reasons, encoding-fallback rate) and alert on a relative jump, not an absolute count. To tell a source change from a parser regression: pin the parser version and re-run *yesterday's* file — if its rejection rate is unchanged, the spike is the new data (source schema/delimiter change); if yesterday's file now also rejects 12%, the parser regressed. The per-reason breakdown localizes it (e.g. a surge of `arity` rejections points to a delimiter change at the source).
- **Val accuracy 0.97 → 0.81 after the leakage fix.** Reframe: 0.97 was measuring memorization, not generalization. Demonstrate by scoring the *old* model on a truly fresh, never-seen holdout (or a future time-split) — it collapses too, proving 0.97 never existed in the wild. 0.81 is the *honest* number that will hold in production; shipping 0.97 would have meant a silent regression the day real traffic arrived.
- **Standing tripwire suite for silent ML bugs.** Promote the debugging sanity checks into CI: an initial-loss ≈ $\ln(V)$ test, an overfit-tiny-batch test, a determinism (same-seed → same-loss) test, and a data-prep leakage assert (`train_ids` disjoint from `test_ids`, transforms fit only on train). Run them on every PR against a fixed tiny fixture so leakage/wrong-reduction/nondeterminism are caught in minutes, not in a future debugging session.
- **Files too large for memory.** Make `parse` a **generator** that `yield`s records (and updates a running `ParseReport`) instead of materializing `rows` — memory drops to $O(1)$ in rows. `csv.reader` already streams over a file object, so pass the file handle instead of an in-memory `StringIO`. Testing changes: add a test asserting constant memory / lazy consumption (e.g. the generator yields the first record before the whole file is read), and keep the golden-file test by consuming the generator into a list in the test only.