Merge four CSVs locally, robustly and efficiently

Q: Merge four CSVs locally, robustly and efficiently

This is a Data Manipulation (SQL/Python) interview question from Capital One for Data Scientist roles. View the full question and solution on PracHub.

Q: How do I approach Data Manipulation (SQL/Python) interview questions?

Data Manipulation (SQL/Python) questions require understanding of core concepts and practice. PracHub provides solutions with explanations to help you master data manipulation (sql/python) interviews.

Question

You receive four CSV files that must be merged locally on a laptop with 8 GB RAM, without relying on cloud services:

products.csv: product_id, category, product_name
purchases.csv: purchase_id, product_id, price, stars
ads_events.csv: date (YYYY-MM-DD), user_id, ad_id, watch_seconds, clicks
categories.csv: category, category_display_name Constraints and quirks: files may use different delimiters (comma/semicolon), encodings (UTF-8/Windows-1252), and may contain duplicate header rows and stray BOMs; purchases.csv can exceed 10 million rows. Task: Produce two outputs—(A) category_min_price.csv containing, for every category in products.csv, the lowest price among purchases with stars > 4 (or 0 if none), and (B) weekly_watchtime.csv containing, for every (ISOYear, ISOWeek, category), the total watch_seconds from ads_events.csv after joining ads_events to products via AdID→product_id is not available; instead, a lookup file ad_product_map.csv is provided at runtime with columns ad_id, product_id. Describe and then implement (in pseudocode or Python) a robust merge pipeline that: detects and normalizes encodings and delimiters; streams large files in chunks; de-duplicates by natural keys; handles missing/NULL stars; computes ISO weeks correctly (state the formula or library call); avoids out-of-memory errors (e.g., via chunked pandas + on-disk store, or SQLite/DuckDB with CREATE TABLE AS SELECT); validates row counts and key coverage; and writes both outputs sorted and reproducible. Include exact join keys, data types you will enforce, and how you would unit test the correctness on a 1000-row synthetic sample.

Merge four CSVs locally, robustly and efficiently

Comments (0)