Merge four CSVs locally, robustly and efficiently
Company: Capital One
Role: Data Scientist
Category: Data Manipulation (SQL/Python)
Difficulty: Medium
Interview Round: Technical Screen
You receive four CSV files that must be merged locally on a laptop with 8 GB RAM, without relying on cloud services:
- products.csv: product_id, category, product_name
- purchases.csv: purchase_id, product_id, price, stars
- ads_events.csv: date (YYYY-MM-DD), user_id, ad_id, watch_seconds, clicks
- categories.csv: category, category_display_name
Constraints and quirks: files may use different delimiters (comma/semicolon), encodings (UTF-8/Windows-1252), and may contain duplicate header rows and stray BOMs; purchases.csv can exceed 10 million rows. Task: Produce two outputs—(A) category_min_price.csv containing, for every category in products.csv, the lowest price among purchases with stars > 4 (or 0 if none), and (B) weekly_watchtime.csv containing, for every (ISOYear, ISOWeek, category), the total watch_seconds from ads_events.csv after joining ads_events to products via AdID→product_id is not available; instead, a lookup file ad_product_map.csv is provided at runtime with columns ad_id, product_id. Describe and then implement (in pseudocode or Python) a robust merge pipeline that: detects and normalizes encodings and delimiters; streams large files in chunks; de-duplicates by natural keys; handles missing/NULL stars; computes ISO weeks correctly (state the formula or library call); avoids out-of-memory errors (e.g., via chunked pandas + on-disk store, or SQLite/DuckDB with CREATE TABLE AS SELECT); validates row counts and key coverage; and writes both outputs sorted and reproducible. Include exact join keys, data types you will enforce, and how you would unit test the correctness on a 1000-row synthetic sample.
Quick Answer: This question evaluates skills in scalable data ingestion, robust data cleaning and normalization, memory-efficient joins and aggregations, and handling heterogeneous file encodings and delimiters within SQL/Python workflows; Domain: Data Manipulation (SQL/Python).