PracHub
QuestionsPremiumLearningGuidesInterview PrepNEWCoaches
|Home/Data Manipulation (SQL/Python)/Capital One

Merge four CSVs locally, robustly and efficiently

Last updated: Mar 29, 2026

Quick Overview

This question evaluates skills in scalable data ingestion, robust data cleaning and normalization, memory-efficient joins and aggregations, and handling heterogeneous file encodings and delimiters within SQL/Python workflows; Domain: Data Manipulation (SQL/Python).

  • Medium
  • Capital One
  • Data Manipulation (SQL/Python)
  • Data Scientist

Merge four CSVs locally, robustly and efficiently

Company: Capital One

Role: Data Scientist

Category: Data Manipulation (SQL/Python)

Difficulty: Medium

Interview Round: Technical Screen

You receive four CSV files that must be merged locally on a laptop with 8 GB RAM, without relying on cloud services: - products.csv: product_id, category, product_name - purchases.csv: purchase_id, product_id, price, stars - ads_events.csv: date (YYYY-MM-DD), user_id, ad_id, watch_seconds, clicks - categories.csv: category, category_display_name Constraints and quirks: files may use different delimiters (comma/semicolon), encodings (UTF-8/Windows-1252), and may contain duplicate header rows and stray BOMs; purchases.csv can exceed 10 million rows. Task: Produce two outputs—(A) category_min_price.csv containing, for every category in products.csv, the lowest price among purchases with stars > 4 (or 0 if none), and (B) weekly_watchtime.csv containing, for every (ISOYear, ISOWeek, category), the total watch_seconds from ads_events.csv after joining ads_events to products via AdID→product_id is not available; instead, a lookup file ad_product_map.csv is provided at runtime with columns ad_id, product_id. Describe and then implement (in pseudocode or Python) a robust merge pipeline that: detects and normalizes encodings and delimiters; streams large files in chunks; de-duplicates by natural keys; handles missing/NULL stars; computes ISO weeks correctly (state the formula or library call); avoids out-of-memory errors (e.g., via chunked pandas + on-disk store, or SQLite/DuckDB with CREATE TABLE AS SELECT); validates row counts and key coverage; and writes both outputs sorted and reproducible. Include exact join keys, data types you will enforce, and how you would unit test the correctness on a 1000-row synthetic sample.

Quick Answer: This question evaluates skills in scalable data ingestion, robust data cleaning and normalization, memory-efficient joins and aggregations, and handling heterogeneous file encodings and delimiters within SQL/Python workflows; Domain: Data Manipulation (SQL/Python).

Related Interview Questions

  • Clean and Merge Housing Data - Capital One (easy)
  • Find Lowest Prices for Highly Rated Categories - Capital One (medium)
  • Write SQL to compute campaign net revenue - Capital One (Medium)
  • Merge CSVs and build revenue pivot with pandas - Capital One (Medium)
  • Find top category per region in Aug 2025 - Capital One (Medium)
Capital One logo
Capital One
Oct 13, 2025, 9:49 PM
Data Scientist
Technical Screen
Data Manipulation (SQL/Python)
6
0

You receive four CSV files that must be merged locally on a laptop with 8 GB RAM, without relying on cloud services:

  • products.csv: product_id, category, product_name
  • purchases.csv: purchase_id, product_id, price, stars
  • ads_events.csv: date (YYYY-MM-DD), user_id, ad_id, watch_seconds, clicks
  • categories.csv: category, category_display_name Constraints and quirks: files may use different delimiters (comma/semicolon), encodings (UTF-8/Windows-1252), and may contain duplicate header rows and stray BOMs; purchases.csv can exceed 10 million rows. Task: Produce two outputs—(A) category_min_price.csv containing, for every category in products.csv, the lowest price among purchases with stars > 4 (or 0 if none), and (B) weekly_watchtime.csv containing, for every (ISOYear, ISOWeek, category), the total watch_seconds from ads_events.csv after joining ads_events to products via AdID→product_id is not available; instead, a lookup file ad_product_map.csv is provided at runtime with columns ad_id, product_id. Describe and then implement (in pseudocode or Python) a robust merge pipeline that: detects and normalizes encodings and delimiters; streams large files in chunks; de-duplicates by natural keys; handles missing/NULL stars; computes ISO weeks correctly (state the formula or library call); avoids out-of-memory errors (e.g., via chunked pandas + on-disk store, or SQLite/DuckDB with CREATE TABLE AS SELECT); validates row counts and key coverage; and writes both outputs sorted and reproducible. Include exact join keys, data types you will enforce, and how you would unit test the correctness on a 1000-row synthetic sample.

Comments (0)

Sign in to leave a comment

Loading comments...

Browse More Questions

More Data Manipulation (SQL/Python)•More Capital One•More Data Scientist•Capital One Data Scientist•Capital One Data Manipulation (SQL/Python)•Data Scientist Data Manipulation (SQL/Python)
PracHub

Master your tech interviews with 7,500+ real questions from top companies.

Product

  • Questions
  • Learning Tracks
  • Interview Guides
  • Resources
  • Premium
  • For Universities
  • Student Access

Browse

  • By Company
  • By Role
  • By Category
  • Topic Hubs
  • SQL Questions
  • Compare Platforms
  • Discord Community

Support

  • support@prachub.com
  • (916) 541-4762

Legal

  • Privacy Policy
  • Terms of Service
  • About Us

© 2026 PracHub. All rights reserved.