Merge four CSVs locally, robustly and efficiently

Q: How do I practice SQL interview questions?

PracHub provides an interactive SQL console where you can write and test queries against real database schemas. Get instant feedback and compare your solution with the expected output.

Q: What difficulty level is this coding question?

This is a medium difficulty Data Manipulation (SQL/Python) question, commonly asked during Technical Screen rounds at Capital One.

Q: What role is this question designed for?

This question is commonly asked for Data Scientist candidates at Capital One during technical interviews.

Question

You receive four CSV files that must be merged locally on a laptop with 8 GB RAM, without relying on cloud services:
- products.csv: product_id, category, product_name
- purchases.csv: purchase_id, product_id, price, stars
- ads_events.csv: date (YYYY-MM-DD), user_id, ad_id, watch_seconds, clicks
- categories.csv: category, category_display_name
Constraints and quirks: files may use different delimiters (comma/semicolon), encodings (UTF-8/Windows-1252), and may contain duplicate header rows and stray BOMs; purchases.csv can exceed 10 million rows. Task: Produce two outputs—(A) category_min_price.csv containing, for every category in products.csv, the lowest price among purchases with stars > 4 (or 0 if none), and (B) weekly_watchtime.csv containing, for every (ISOYear, ISOWeek, category), the total watch_seconds from ads_events.csv after joining ads_events to products via AdID→product_id is not available; instead, a lookup file ad_product_map.csv is provided at runtime with columns ad_id, product_id. Describe and then implement (in pseudocode or Python) a robust merge pipeline that: detects and normalizes encodings and delimiters; streams large files in chunks; de-duplicates by natural keys; handles missing/NULL stars; computes ISO weeks correctly (state the formula or library call); avoids out-of-memory errors (e.g., via chunked pandas + on-disk store, or SQLite/DuckDB with CREATE TABLE AS SELECT); validates row counts and key coverage; and writes both outputs sorted and reproducible. Include exact join keys, data types you will enforce, and how you would unit test the correctness on a 1000-row synthetic sample.

PracHub · Accepted Answer

This question evaluates skills in scalable data ingestion, robust data cleaning and normalization, memory-efficient joins and aggregations, and handling heterogeneous file encodings and delimiters within SQL/Python workflows; Domain: Data Manipulation (SQL/Python).

Quick Overview

Quick Overview