PracHub
QuestionsPremiumLearningGuidesCheatsheetNEWCoaches

Quick Overview

This question evaluates competency in large-scale data manipulation and preprocessing, including robust file I/O (delimiter and encoding detection), memory-efficient chunked loading, dtype enforcement, deduplication, imputation strategy design, outlier handling, and statistical visualization techniques.

  • Medium
  • Voleon Group
  • Data Manipulation (SQL/Python)
  • Data Scientist

Load and visualize large CSV robustly

Company: Voleon Group

Role: Data Scientist

Category: Data Manipulation (SQL/Python)

Difficulty: Medium

Interview Round: Technical Screen

You're screen-sharing in a HackerRank environment with Python 3, pandas, numpy, seaborn, and matplotlib available. You are given a single file data.csv (~1.5 GB) whose delimiter is unknown (',' or ';') and encoding is either UTF-8 or latin-1. Columns: id:int, date:YYYY-MM-DD, region:str, spend:float, clicks:int, signups:int. Up to 5% values may be missing; there can be exact duplicate rows; and some rows have clicks=0. Write code to: (1) detect delimiter and encoding without loading the full file; then load in chunks while keeping peak memory under 1 GB; (2) drop exact duplicates and enforce dtypes; (3) impute missing spend with the median within region; impute missing clicks/signups with 0 only if the entire row's non-null feature count >= 4, otherwise drop the row—justify this rule; (4) create cpc = spend / clicks with safe division and winsorize cpc at the 1st/99th percentiles by region; (5) produce and save: (a) a scatter plot of spend vs signups with a LOWESS smoothed line and 95% CI, (b) a boxplot of signups by region sorted by median, and (c) a time-series line of daily total signups; (6) briefly explain your memory/time complexity choices and how you'd test this code. Provide runnable, end-to-end code with any assumptions stated explicitly.

Quick Answer: This question evaluates competency in large-scale data manipulation and preprocessing, including robust file I/O (delimiter and encoding detection), memory-efficient chunked loading, dtype enforcement, deduplication, imputation strategy design, outlier handling, and statistical visualization techniques.

Last updated: Mar 29, 2026

Related Coding Questions

  • Analyze time-zoned events with pandas - Voleon Group (Medium)
  • Pre-process Financial Data for Linear Regression Modeling - Voleon Group (Medium)

Loading coding console...

PracHub

Master your tech interviews with 7,500+ real questions from top companies.

Product

  • Questions
  • Learning Tracks
  • Interview Guides
  • Resources
  • Premium
  • For Universities
  • Student Access

Browse

  • By Company
  • By Role
  • By Category
  • Topic Hubs
  • SQL Questions
  • Compare Platforms
  • Discord Community

Support

  • support@prachub.com
  • (916) 541-4762

Legal

  • Privacy Policy
  • Terms of Service
  • About Us

© 2026 PracHub. All rights reserved.