You receive four CSV files that must be merged locally on a laptop with 8 GB RAM, without relying on cloud services:
-
products.csv: product_id, category, product_name
-
purchases.csv: purchase_id, product_id, price, stars
-
ads_events.csv: date (YYYY-MM-DD), user_id, ad_id, watch_seconds, clicks
-
categories.csv: category, category_display_name
Constraints and quirks: files may use different delimiters (comma/semicolon), encodings (UTF-8/Windows-1252), and may contain duplicate header rows and stray BOMs; purchases.csv can exceed 10 million rows. Task: Produce two outputs—(A) category_min_price.csv containing, for every category in products.csv, the lowest price among purchases with stars > 4 (or 0 if none), and (B) weekly_watchtime.csv containing, for every (ISOYear, ISOWeek, category), the total watch_seconds from ads_events.csv after joining ads_events to products via AdID→product_id is not available; instead, a lookup file ad_product_map.csv is provided at runtime with columns ad_id, product_id. Describe and then implement (in pseudocode or Python) a robust merge pipeline that: detects and normalizes encodings and delimiters; streams large files in chunks; de-duplicates by natural keys; handles missing/NULL stars; computes ISO weeks correctly (state the formula or library call); avoids out-of-memory errors (e.g., via chunked pandas + on-disk store, or SQLite/DuckDB with CREATE TABLE AS SELECT); validates row counts and key coverage; and writes both outputs sorted and reproducible. Include exact join keys, data types you will enforce, and how you would unit test the correctness on a 1000-row synthetic sample.