Build a Transaction CSV Cleaning Pipeline
Company: Tesla
Role: Data Engineer
Category: Software Engineering Fundamentals
Difficulty: medium
Interview Round: Technical Screen
You receive a raw CSV file containing customer transaction data with the columns `date`, `amount`, `customer_id`, `payment_method`, and `notes`.
The file has several known data-quality problems:
- **`date`** appears in mixed formats within the same column, such as `2024-01-05` (`YYYY-MM-DD`), `01/05/2024` (`MM/DD/YYYY`), and `Jan 5, 2024`.
- **`amount`** contains currency symbols and thousands separators, such as `$1,234.56`; some values may be parenthesized or carry a trailing minus to indicate refunds.
- **`customer_id`** should preserve leading zeros (e.g. `00042`), but some IDs have had them stripped by spreadsheet software that interpreted the field as a number (so `00042` arrives as `42`).
- **`payment_method`** contains misspellings and inconsistent capitalization (e.g. `Credit Card`, `creditcard`, `cred card`).
Design a cleaning pipeline that handles these issues, then explain what validations you would run **after** cleaning to confirm the data is usable for downstream analytics.
### Constraints & Assumptions
- This is a **batch** pipeline: a new file arrives periodically (e.g. daily) and is processed end to end.
- The CSV is large enough that you should not assume manual, row-by-row inspection — corrections must be rule-driven and reproducible.
- The cleaned output feeds downstream analytics (revenue reporting, customer-level aggregation), so correctness and auditability matter more than raw throughput.
- You may assume access to supporting metadata where reasonable (e.g. a customer reference table, a canonical list of valid payment methods) — but you should state whenever your approach depends on such a resource, and what you do if it is unavailable.
```hint Where to start
Think in ordered **stages** rather than one in-place pass over a single dataframe. Cleaning should be a sequence of deliberate steps, not implicit coercion as a side effect of loading — decide what the boundaries between those steps are and what each one is responsible for.
```
```hint Watch the loader's defaults
The `customer_id` problem can be re-created by your own pipeline: if the loader infers column types, it may re-strip the leading zeros you are trying to recover. Think about *when* each column first acquires a type, and whether the default inference works for or against you here.
```
```hint Determinism over guessing
For ambiguous cases — an unparseable date, a `payment_method` too far from any known value, a `customer_id` with no width rule — prefer **flag-and-quarantine** over silently guessing. Use an explicit list of accepted date formats and a curated synonym map rather than a single permissive "smart" parser.
```
```hint Money precision
`amount` feeds revenue reporting, so cleaned totals must reconcile exactly against the source. Be deliberate about which numeric representation you parse into — the obvious default for "a number with decimals" can reintroduce rounding error that downstream reconciliation will surface.
```
### Clarifying Questions to Ask
- Is there a **fixed expected width** for `customer_id` (so leading zeros can be restored by left-padding), or a customer reference table to validate against?
- What is the **canonical set** of valid `payment_method` values, and is there an approved synonym/misspelling dictionary?
- How should **negative or zero amounts** be treated — legitimate refunds/adjustments, or invalid rows to reject?
- For dates, is the source **timezone** known, and are dates expected to fall within a particular range (e.g. no future dates)?
- What is the policy for **bad rows** — drop them, quarantine them for review, or block the whole file if too many fail?
- Are there **uniqueness/business-key** expectations (e.g. one row per transaction id) that should drive duplicate detection?
### What a Strong Answer Covers
- **The overall shape of the pipeline**: how the candidate decomposes the work and where the raw input fits, versus collapsing everything into one tangled pass.
- **How `customer_id` is defended against the loader** so the very coercion that damaged the data isn't repeated by the pipeline that's meant to fix it.
- **How `date` ambiguity is resolved** when the same column carries multiple formats, and what happens to values that don't fit any of them.
- **How `amount` is made to reconcile** so cleaned revenue totals tie back to the source, including a defensible rule for signs (refunds) and zeros.
- **How `payment_method` is canonicalized** without over-eagerly "correcting" values that are actually unknown.
- **What happens to rows that can't be cleaned** — whether bad data is recoverable, traceable, and kept out of the clean stream rather than dropped silently.
- **Auditability and reproducibility**: whether a run can be explained and repeated, and what lineage the output carries.
- **The post-cleaning validation strategy**: which checks would actually catch a corrupted or regressed file before it reaches a report.
### Follow-up Questions
- How would you make the pipeline **incremental and re-runnable** so that reprocessing the same file does not create duplicates or double-count revenue?
- If 30% of rows suddenly fail date parsing one day, how would your pipeline **detect and respond** to that regression before it corrupts downstream reports?
- How would you evolve this from a one-off script into a **monitored, scheduled** pipeline (orchestration, alerting on data-quality SLAs, schema-drift handling)?
- How would your design change if the same data arrived as a **continuous stream** rather than a daily batch?
Quick Answer: This question evaluates skills in data engineering and software engineering fundamentals, focusing on designing deterministic, auditable batch ETL/cleaning pipelines, data parsing, identifier preservation, numeric precision, and post-cleaning validation.