PracHub
QuestionsPremiumCoachesLearningGuidesInterview Prep
|Home/Software Engineering Fundamentals/Tesla

Build a Transaction CSV Cleaning Pipeline

Last updated: Jun 17, 2026

Quick Overview

This question evaluates skills in data engineering and software engineering fundamentals, focusing on designing deterministic, auditable batch ETL/cleaning pipelines, data parsing, identifier preservation, numeric precision, and post-cleaning validation.

  • medium
  • Tesla
  • Software Engineering Fundamentals
  • Data Engineer

Build a Transaction CSV Cleaning Pipeline

Company: Tesla

Role: Data Engineer

Category: Software Engineering Fundamentals

Difficulty: medium

Interview Round: Technical Screen

You receive a raw CSV file containing customer transaction data with the columns `date`, `amount`, `customer_id`, `payment_method`, and `notes`. The file has several known data-quality problems: - **`date`** appears in mixed formats within the same column, such as `2024-01-05` (`YYYY-MM-DD`), `01/05/2024` (`MM/DD/YYYY`), and `Jan 5, 2024`. - **`amount`** contains currency symbols and thousands separators, such as `$1,234.56`; some values may be parenthesized or carry a trailing minus to indicate refunds. - **`customer_id`** should preserve leading zeros (e.g. `00042`), but some IDs have had them stripped by spreadsheet software that interpreted the field as a number (so `00042` arrives as `42`). - **`payment_method`** contains misspellings and inconsistent capitalization (e.g. `Credit Card`, `creditcard`, `cred card`). Design a cleaning pipeline that handles these issues, then explain what validations you would run **after** cleaning to confirm the data is usable for downstream analytics. ### Constraints & Assumptions - This is a **batch** pipeline: a new file arrives periodically (e.g. daily) and is processed end to end. - The CSV is large enough that you should not assume manual, row-by-row inspection — corrections must be rule-driven and reproducible. - The cleaned output feeds downstream analytics (revenue reporting, customer-level aggregation), so correctness and auditability matter more than raw throughput. - You may assume access to supporting metadata where reasonable (e.g. a customer reference table, a canonical list of valid payment methods) — but you should state whenever your approach depends on such a resource, and what you do if it is unavailable. ```hint Where to start Think in ordered **stages** rather than one in-place pass over a single dataframe. Cleaning should be a sequence of deliberate steps, not implicit coercion as a side effect of loading — decide what the boundaries between those steps are and what each one is responsible for. ``` ```hint Watch the loader's defaults The `customer_id` problem can be re-created by your own pipeline: if the loader infers column types, it may re-strip the leading zeros you are trying to recover. Think about *when* each column first acquires a type, and whether the default inference works for or against you here. ``` ```hint Determinism over guessing For ambiguous cases — an unparseable date, a `payment_method` too far from any known value, a `customer_id` with no width rule — prefer **flag-and-quarantine** over silently guessing. Use an explicit list of accepted date formats and a curated synonym map rather than a single permissive "smart" parser. ``` ```hint Money precision `amount` feeds revenue reporting, so cleaned totals must reconcile exactly against the source. Be deliberate about which numeric representation you parse into — the obvious default for "a number with decimals" can reintroduce rounding error that downstream reconciliation will surface. ``` ### Clarifying Questions to Ask - Is there a **fixed expected width** for `customer_id` (so leading zeros can be restored by left-padding), or a customer reference table to validate against? - What is the **canonical set** of valid `payment_method` values, and is there an approved synonym/misspelling dictionary? - How should **negative or zero amounts** be treated — legitimate refunds/adjustments, or invalid rows to reject? - For dates, is the source **timezone** known, and are dates expected to fall within a particular range (e.g. no future dates)? - What is the policy for **bad rows** — drop them, quarantine them for review, or block the whole file if too many fail? - Are there **uniqueness/business-key** expectations (e.g. one row per transaction id) that should drive duplicate detection? ### What a Strong Answer Covers - **The overall shape of the pipeline**: how the candidate decomposes the work and where the raw input fits, versus collapsing everything into one tangled pass. - **How `customer_id` is defended against the loader** so the very coercion that damaged the data isn't repeated by the pipeline that's meant to fix it. - **How `date` ambiguity is resolved** when the same column carries multiple formats, and what happens to values that don't fit any of them. - **How `amount` is made to reconcile** so cleaned revenue totals tie back to the source, including a defensible rule for signs (refunds) and zeros. - **How `payment_method` is canonicalized** without over-eagerly "correcting" values that are actually unknown. - **What happens to rows that can't be cleaned** — whether bad data is recoverable, traceable, and kept out of the clean stream rather than dropped silently. - **Auditability and reproducibility**: whether a run can be explained and repeated, and what lineage the output carries. - **The post-cleaning validation strategy**: which checks would actually catch a corrupted or regressed file before it reaches a report. ### Follow-up Questions - How would you make the pipeline **incremental and re-runnable** so that reprocessing the same file does not create duplicates or double-count revenue? - If 30% of rows suddenly fail date parsing one day, how would your pipeline **detect and respond** to that regression before it corrupts downstream reports? - How would you evolve this from a one-off script into a **monitored, scheduled** pipeline (orchestration, alerting on data-quality SLAs, schema-drift handling)? - How would your design change if the same data arrived as a **continuous stream** rather than a daily batch?

Quick Answer: This question evaluates skills in data engineering and software engineering fundamentals, focusing on designing deterministic, auditable batch ETL/cleaning pipelines, data parsing, identifier preservation, numeric precision, and post-cleaning validation.

Related Interview Questions

  • Explain reliability and runtime fundamentals - Tesla (medium)
Tesla logo
Tesla
May 4, 2026, 12:00 AM
Data Engineer
Technical Screen
Software Engineering Fundamentals
24
0

You receive a raw CSV file containing customer transaction data with the columns date, amount, customer_id, payment_method, and notes.

The file has several known data-quality problems:

  • date appears in mixed formats within the same column, such as 2024-01-05 ( YYYY-MM-DD ), 01/05/2024 ( MM/DD/YYYY ), and Jan 5, 2024 .
  • amount contains currency symbols and thousands separators, such as $1,234.56 ; some values may be parenthesized or carry a trailing minus to indicate refunds.
  • customer_id should preserve leading zeros (e.g. 00042 ), but some IDs have had them stripped by spreadsheet software that interpreted the field as a number (so 00042 arrives as 42 ).
  • payment_method contains misspellings and inconsistent capitalization (e.g. Credit Card , creditcard , cred card ).

Design a cleaning pipeline that handles these issues, then explain what validations you would run after cleaning to confirm the data is usable for downstream analytics.

Constraints & Assumptions

  • This is a batch pipeline: a new file arrives periodically (e.g. daily) and is processed end to end.
  • The CSV is large enough that you should not assume manual, row-by-row inspection — corrections must be rule-driven and reproducible.
  • The cleaned output feeds downstream analytics (revenue reporting, customer-level aggregation), so correctness and auditability matter more than raw throughput.
  • You may assume access to supporting metadata where reasonable (e.g. a customer reference table, a canonical list of valid payment methods) — but you should state whenever your approach depends on such a resource, and what you do if it is unavailable.

Clarifying Questions to Ask

  • Is there a fixed expected width for customer_id (so leading zeros can be restored by left-padding), or a customer reference table to validate against?
  • What is the canonical set of valid payment_method values, and is there an approved synonym/misspelling dictionary?
  • How should negative or zero amounts be treated — legitimate refunds/adjustments, or invalid rows to reject?
  • For dates, is the source timezone known, and are dates expected to fall within a particular range (e.g. no future dates)?
  • What is the policy for bad rows — drop them, quarantine them for review, or block the whole file if too many fail?
  • Are there uniqueness/business-key expectations (e.g. one row per transaction id) that should drive duplicate detection?

What a Strong Answer Covers

  • The overall shape of the pipeline : how the candidate decomposes the work and where the raw input fits, versus collapsing everything into one tangled pass.
  • How customer_id is defended against the loader so the very coercion that damaged the data isn't repeated by the pipeline that's meant to fix it.
  • How date ambiguity is resolved when the same column carries multiple formats, and what happens to values that don't fit any of them.
  • How amount is made to reconcile so cleaned revenue totals tie back to the source, including a defensible rule for signs (refunds) and zeros.
  • How payment_method is canonicalized without over-eagerly "correcting" values that are actually unknown.
  • What happens to rows that can't be cleaned — whether bad data is recoverable, traceable, and kept out of the clean stream rather than dropped silently.
  • Auditability and reproducibility : whether a run can be explained and repeated, and what lineage the output carries.
  • The post-cleaning validation strategy : which checks would actually catch a corrupted or regressed file before it reaches a report.

Follow-up Questions

  • How would you make the pipeline incremental and re-runnable so that reprocessing the same file does not create duplicates or double-count revenue?
  • If 30% of rows suddenly fail date parsing one day, how would your pipeline detect and respond to that regression before it corrupts downstream reports?
  • How would you evolve this from a one-off script into a monitored, scheduled pipeline (orchestration, alerting on data-quality SLAs, schema-drift handling)?
  • How would your design change if the same data arrived as a continuous stream rather than a daily batch?

Solution

Show

Submit Your Answer to Earn 20XP

Sign in to leave a comment

Loading comments...

Browse More Questions

More Software Engineering Fundamentals•More Tesla•More Data Engineer•Tesla Data Engineer•Tesla Software Engineering Fundamentals•Data Engineer Software Engineering Fundamentals
PracHub

Master your tech interviews with 8,000+ real questions from top companies.

Product

  • Questions
  • Learning Tracks
  • Interview Guides
  • Resources
  • Premium
  • For Universities
  • Student Access

Browse

  • By Company
  • By Role
  • By Category
  • Topic Hubs
  • SQL Questions
  • Compare Platforms
  • Discord Community

Support

  • support@prachub.com
  • (916) 541-4762

Legal

  • Privacy Policy
  • Terms of Service
  • About Us

© 2026 PracHub. All rights reserved.