- Given raw text files with noisy formatting, implement a robust parser that outputs structured examples; handle delimiters, quoting/escaping, encodings/Unicode, missing fields, and malformed lines, and describe how you would test it. - In a provided ML project (data loading, preprocessing, training, evaluation), identify and fix three defects (e.g., index off-by-one in tokenization, train/test leakage, incorrect loss reduction, nondeterministic seeding, or shape mismatches). Explain your rapid debugging approach (stack traces, assertions, binary search logging, minimal repros). - Describe how you would validate the fixes under a 60-minute time limit (unit tests, end-to-end run, metrics sanity checks, and regression guards).

This is a hands-on ML-engineering session, not a theory quiz. Three things are measured: whether you can write a *defensive* parser that survives real-world dirty data, whether you can *triage and fix bugs fast* in unfamiliar code, and whether you can *prove* a fix is correct under a clock. The interviewer cares less about elegant code than about how you reason under time pressure and how you communicate trade-offs. Below is a structured approach to each part with code you can actually type. ## Scoping the session first Before any code, state the contract out loud — it signals engineering maturity and prevents over-fitting to one example: - **Parser**: what exact schema do I emit (field names, required/optional, types)? What delimiter/quoting convention? What do you want done with malformed/missing rows — drop, quarantine, or repair? - **ML project**: is there a known failing symptom, or do I find all three defects cold? Is there an expected metric range and a tiny data subset for fast iteration? - **Constraints**: minimal-diff fixes (no refactor)? Should I add tests, or just demonstrate? --- ## Part 1 — A robust text parser ### Frame the contract - **Input**: line-oriented text, a configurable delimiter (default `\t` or `,`), optional quoting, unknown/mixed encoding, unknown rate of malformed rows. - **Output**: a stream of structured records (`dict` / `dataclass`) **plus** a side channel of rejected lines with reasons. Never silently drop data. - **Principle**: be liberal in what you accept, strict in what you emit, **observable** in what you drop — every skipped line is counted and logged. A parser that discards 30% of a dataset without telling you is a data-quality incident waiting to happen. ### Reuse vs. hand-roll Say this explicitly — it's a senior signal. For comma/tab data with quoting and escaping, Python's `csv` module already implements RFC 4180 correctly (doubled quotes inside quoted fields, embedded delimiters/newlines). **Reaching for `csv.reader` first is the right instinct** — re-implementing quote handling by hand is where bugs live. Hand-roll only for genuinely non-CSV formats (fixed-width, custom escapes, or multi-char delimiters `csv` can't express). ### Encoding strategy Encoding is the most common silent corruption source: 1. **Open in binary**, sniff a BOM (`utf-8-sig` handles UTF-8 BOM), then attempt strict `utf-8`. 2. On `UnicodeDecodeError`, fall back to a permissive decode (`latin-1` never raises) — but **record that the fallback fired** so you know the file is suspect. 3. **Normalize** with `unicodedata.normalize("NFC", ...)` so visually identical strings compare equal; strip a stray mid-stream BOM. Decide deliberately whether to strip zero-width / control characters. ```python import csv, io, logging, math, unicodedata from dataclasses import dataclass, field logger = logging.getLogger("parser") def decode_bytes(raw: bytes) -> tuple[str, bool]: """Return (text, used_fallback).""" for enc in ("utf-8-sig", "utf-8"): # utf-8-sig strips a leading file BOM try: return raw.decode(enc), False except UnicodeDecodeError: continue return raw.decode("latin-1", errors="replace"), True # never raises; flag it def clean_cell(s: str) -> str: s = unicodedata.normalize("NFC", s) # decode_bytes already strips a leading file BOM via utf-8-sig; this lstrip # only catches a stray BOM that survived mid-stream (after a delimiter). return s.strip().lstrip("") ``` ### The parser core ```python @dataclass class ParseReport: total: int = 0 # data rows seen (header NOT counted here) ok: int = 0 blanks: int = 0 rejected: list[tuple[int, str, str]] = field(default_factory=list) # (lineno, reason, raw) encoding_fallback: bool = False REQUIRED = ("id", "text", "label") def fields_to_str(fields: list[str]) -> str: return "\t".join(fields)[:200] # truncate so the log doesn't explode def parse(raw: bytes, delimiter: str = "\t") -> tuple[list[dict], ParseReport]: text, fallback = decode_bytes(raw) report = ParseReport(encoding_fallback=fallback) rows: list[dict] = [] reader = csv.reader( io.StringIO(text, newline=""), # let csv own newline handling delimiter=delimiter, quotechar='"', doublequote=True, # "" -> " inside quoted fields skipinitialspace=True, ) header = None for lineno, fields in enumerate(reader, start=1): if not fields or all(c.strip() == "" for c in fields): report.blanks += 1 continue # blank line: expected noise, skip if header is None: # first non-blank row is the header header = [clean_cell(c) for c in fields] missing = [c for c in REQUIRED if c not in header] if missing: raise ValueError(f"header missing required columns: {missing}")

How do I approach Data Manipulation (SQL/Python) interview questions?

Data Manipulation (SQL/Python) questions require understanding of core concepts and practice. PracHub provides solutions with explanations to help you master data manipulation (sql/python) interviews.

What difficulty level is this interview question?

This is a Medium difficulty Data Manipulation (SQL/Python) question, commonly asked during Technical Screen rounds at Scale AI.

What role is this question designed for?

This question is commonly asked for Machine Learning Engineer candidates at Scale AI during technical interviews.

Debug ML pipeline and build text parser | Scale AI Interview Question

Q: Debug ML pipeline and build text parser

This question evaluates practical machine learning engineering skills, including debugging unfamiliar codebases, parsing noisy text data, and identifying subtle defects such as train/test leakage or shape mismatches. It tests applied Python proficiency and systematic reasoning under time pressure, core competencies assessed in ML engineer technical interviews.

You are in a hands-on, hour-long ML-engineering working session (Scale AI, Machine Learning Engineer loop). You are given a small ML project — data loading, preprocessing, training, evaluation — that is supposed to train a text classifier/sequence model, but the data is dirty and the code has defects. The interviewer is watching how you work, not just the final diff: how you reason about messy input, how fast you locate bugs in unfamiliar code, and how you prove a fix is correct under a clock.

The session has three parts, in order.

Constraints & Assumptions

Time budget : the whole working session is ~60 minutes. Part 3 explicitly asks you to validate within that limit, so you must spend the clock deliberately.
Codebase : the provided ML project is "a lot of code" you have not seen before — you cannot read it top-to-bottom in time.
Data : the raw text files are line-oriented but noisy — inconsistent delimiters, quoting/escaping, mixed/unknown encodings, missing fields, and malformed lines all appear.
Defect count : exactly three defects are planted in the project (Part 2). They are drawn from a known menu: index off-by-one in tokenization, train/test leakage, incorrect loss reduction, nondeterministic seeding, or shape mismatches.
Stack : assume a standard Python + (PyTorch-style) ML stack; you may use the standard library and common libraries.

Clarifying Questions to Ask

Scope the whole session up front with the interviewer before writing code:

What is the exact record schema the parser should emit (field names, required vs. optional, types), and what is the delimiter/quoting convention of the input files?
For malformed or missing-field lines, do you want them dropped, quarantined with a reason, or repaired? Is silently losing rows acceptable?
For the ML project: is there a known failing symptom (a crash, a metric that looks too good, non-reproducible results), or do I discover the three defects cold?
Is there a ground-truth/expected metric range I should validate against, and is a tiny subset of the data available for fast iteration?
Do the three defects need to be fixed with minimal diffs (no refactor), and should I add tests, or just demonstrate the fixes?

Part 1 — Build a robust text parser

Given the raw, noisily-formatted text files, implement a parser that turns them into structured examples. It must handle: configurable delimiters, quoting/escaping, encodings and Unicode, missing fields, and malformed lines — and must not silently discard data. Then describe how you would test it.

What This Part Should Cover

Correct handling of quoting/escaping and embedded delimiters/newlines (the case a naive split breaks on), plus mixed line endings and trailing delimiters.
A deliberate encoding/Unicode strategy with an observable fallback, not a silent guess.
An explicit malformed-line and missing-field policy (quarantine-with-reason, never crash, never silently drop).
A layered test plan: unit fixtures per edge case (asserting the report, not just the records), round-trip/property fuzzing, a golden file, and adversarial random-bytes input.

Part 2 — Find and fix three defects fast

In the provided ML project (load → preprocess/tokenize → batch → train → eval), identify and fix three defects, then explain your rapid-debugging approach. There is a lot of unfamiliar code, so the skill being measured is locating the problems quickly, not reading everything.

What This Part Should Cover

A repeatable triage method: reproduce first, read the trace bottom-up into project code, bisect with invariants, build a minimal repro for fast iteration.
For each defect: the observable symptom, the root cause, and a minimal correct fix (no gratuitous refactor).
Awareness that some planted bugs are silent and require proactive sanity checks rather than waiting for a crash.

Part 3 — Validate the fixes in 60 minutes

Describe how you would confirm the three fixes are correct within the time limit, using unit tests, an end-to-end run, metrics sanity checks, and regression guards.

What This Part Should Cover

A concrete time allocation across triage, fixing, and validation that fits 60 minutes.
One targeted test per fix, each demonstrably failing on the buggy version.
An end-to-end smoke run with explicit success criteria (completes, loss decreases, overfits a tiny batch) and metric sanity checks against a baseline.
Regression guards (determinism check, leakage assert, kept unit tests) plus honesty about anything left unvalidated.

What a Strong Answer Covers

Across all three parts, the interviewer is reading for engineering judgment under time pressure, not perfect code:

Defensive, observable engineering : input/output contracts stated before coding, no silent data loss, invariants asserted rather than eyeballed.
Speed of iteration : minimal repros, bisection, and tiny subsets so each experiment runs in seconds — the candidate optimizes total fix time, not line-by-line reading.
Distinguishing crashes from silent wrongness and proving correctness with tests that fail on the bug, including explaining why a leakage fix legitimately lowers validation accuracy.
Communication : verbalizing trade-offs, time allocation, and residual risk — treating the session as collaboration, not a silent coding exam.

Follow-up Questions

Your parser quarantines malformed lines. The rejection rate jumps from 0.5% to 12% overnight in production. How do you alert on this, and how do you decide whether it's a data-source change or a parser regression?
You fixed train/test leakage and validation accuracy dropped from 0.97 to 0.81. How do you convince a skeptical stakeholder this is an improvement , not a regression?
The three bugs included a silent one you found only via the $\ln(V)$ loss check. How would you build a standing "tripwire" suite so silent ML bugs (leakage, wrong reduction, nondeterminism) are caught in CI rather than in a debugging session?
For very large input files that don't fit in memory, how would you restructure the parser, and what changes in your testing strategy?

The session has three parts, in order.

Constraints & Assumptions

Time budget : the whole working session is ~60 minutes. Part 3 explicitly asks you to validate within that limit, so you must spend the clock deliberately.
Codebase : the provided ML project is "a lot of code" you have not seen before — you cannot read it top-to-bottom in time.
Data : the raw text files are line-oriented but noisy — inconsistent delimiters, quoting/escaping, mixed/unknown encodings, missing fields, and malformed lines all appear.
Defect count : exactly three defects are planted in the project (Part 2). They are drawn from a known menu: index off-by-one in tokenization, train/test leakage, incorrect loss reduction, nondeterministic seeding, or shape mismatches.
Stack : assume a standard Python + (PyTorch-style) ML stack; you may use the standard library and common libraries.

Clarifying Questions to Ask

Scope the whole session up front with the interviewer before writing code:

What is the exact record schema the parser should emit (field names, required vs. optional, types), and what is the delimiter/quoting convention of the input files?
For malformed or missing-field lines, do you want them dropped, quarantined with a reason, or repaired? Is silently losing rows acceptable?
For the ML project: is there a known failing symptom (a crash, a metric that looks too good, non-reproducible results), or do I discover the three defects cold?
Is there a ground-truth/expected metric range I should validate against, and is a tiny subset of the data available for fast iteration?
Do the three defects need to be fixed with minimal diffs (no refactor), and should I add tests, or just demonstrate the fixes?

Part 1 — Build a robust text parser

What This Part Should Cover

Correct handling of quoting/escaping and embedded delimiters/newlines (the case a naive split breaks on), plus mixed line endings and trailing delimiters.
A deliberate encoding/Unicode strategy with an observable fallback, not a silent guess.
An explicit malformed-line and missing-field policy (quarantine-with-reason, never crash, never silently drop).
A layered test plan: unit fixtures per edge case (asserting the report, not just the records), round-trip/property fuzzing, a golden file, and adversarial random-bytes input.

Part 2 — Find and fix three defects fast

What This Part Should Cover

A repeatable triage method: reproduce first, read the trace bottom-up into project code, bisect with invariants, build a minimal repro for fast iteration.
For each defect: the observable symptom, the root cause, and a minimal correct fix (no gratuitous refactor).
Awareness that some planted bugs are silent and require proactive sanity checks rather than waiting for a crash.

Part 3 — Validate the fixes in 60 minutes

Describe how you would confirm the three fixes are correct within the time limit, using unit tests, an end-to-end run, metrics sanity checks, and regression guards.

What This Part Should Cover

A concrete time allocation across triage, fixing, and validation that fits 60 minutes.
One targeted test per fix, each demonstrably failing on the buggy version.
An end-to-end smoke run with explicit success criteria (completes, loss decreases, overfits a tiny batch) and metric sanity checks against a baseline.
Regression guards (determinism check, leakage assert, kept unit tests) plus honesty about anything left unvalidated.

What a Strong Answer Covers

Across all three parts, the interviewer is reading for engineering judgment under time pressure, not perfect code:

Defensive, observable engineering : input/output contracts stated before coding, no silent data loss, invariants asserted rather than eyeballed.
Speed of iteration : minimal repros, bisection, and tiny subsets so each experiment runs in seconds — the candidate optimizes total fix time, not line-by-line reading.
Distinguishing crashes from silent wrongness and proving correctness with tests that fail on the bug, including explaining why a leakage fix legitimately lowers validation accuracy.
Communication : verbalizing trade-offs, time allocation, and residual risk — treating the session as collaboration, not a silent coding exam.

Follow-up Questions

Your parser quarantines malformed lines. The rejection rate jumps from 0.5% to 12% overnight in production. How do you alert on this, and how do you decide whether it's a data-source change or a parser regression?
You fixed train/test leakage and validation accuracy dropped from 0.97 to 0.81. How do you convince a skeptical stakeholder this is an improvement , not a regression?
The three bugs included a silent one you found only via the $\ln(V)$ loss check. How would you build a standing "tripwire" suite so silent ML bugs (leakage, wrong reduction, nondeterminism) are caught in CI rather than in a debugging session?
For very large input files that don't fit in memory, how would you restructure the parser, and what changes in your testing strategy?

Debug ML pipeline and build text parser

Quick Overview

Debug ML pipeline and build text parser

Constraints & Assumptions

Clarifying Questions to Ask

Part 1 — Build a robust text parser

What This Part Should Cover

Part 2 — Find and fix three defects fast

What This Part Should Cover

Part 3 — Validate the fixes in 60 minutes

What This Part Should Cover

What a Strong Answer Covers

Follow-up Questions

Write your answer

Debug ML pipeline and build text parser

Quick Overview

Debug ML pipeline and build text parser

Constraints & Assumptions

Clarifying Questions to Ask

Part 1 — Build a robust text parser

What This Part Should Cover

Part 2 — Find and fix three defects fast

What This Part Should Cover

Part 3 — Validate the fixes in 60 minutes

What This Part Should Cover

What a Strong Answer Covers

Follow-up Questions

Write your answer

Debug ML pipeline and build text parser

Quick Overview