Load and visualize large CSV robustly

Q: How do I practice SQL interview questions?

PracHub provides an interactive SQL console where you can write and test queries against real database schemas. Get instant feedback and compare your solution with the expected output.

Q: What difficulty level is this coding question?

This is a Medium difficulty Data Manipulation (SQL/Python) question, commonly asked during Technical Screen rounds at Voleon Group.

Q: What role is this question designed for?

This question is commonly asked for Data Scientist candidates at Voleon Group during technical interviews.

Question

You're screen-sharing in a HackerRank environment with Python 3, pandas, numpy, seaborn, and matplotlib available. You are given a single file data.csv (~1.5 GB) whose delimiter is unknown (',' or ';') and encoding is either UTF-8 or latin-1. Columns: id:int, date:YYYY-MM-DD, region:str, spend:float, clicks:int, signups:int. Up to 5% values may be missing; there can be exact duplicate rows; and some rows have clicks=0. Write code to: (1) detect delimiter and encoding without loading the full file; then load in chunks while keeping peak memory under 1 GB; (2) drop exact duplicates and enforce dtypes; (3) impute missing spend with the median within region; impute missing clicks/signups with 0 only if the entire row's non-null feature count >= 4, otherwise drop the row—justify this rule; (4) create cpc = spend / clicks with safe division and winsorize cpc at the 1st/99th percentiles by region; (5) produce and save: (a) a scatter plot of spend vs signups with a LOWESS smoothed line and 95% CI, (b) a boxplot of signups by region sorted by median, and (c) a time-series line of daily total signups; (6) briefly explain your memory/time complexity choices and how you'd test this code. Provide runnable, end-to-end code with any assumptions stated explicitly.

PracHub · Accepted Answer

This question evaluates competency in large-scale data manipulation and preprocessing, including robust file I/O (delimiter and encoding detection), memory-efficient chunked loading, dtype enforcement, deduplication, imputation strategy design, outlier handling, and statistical visualization techniques.

Quick Overview

Quick Overview