PracHub
QuestionsPremiumLearningGuidesInterview PrepNEWCoaches

Quick Overview

This question evaluates proficiency in data imputation, strict train/validation leakage prevention, temporal per-user propagation rules, and categorical aggregation for model-ready features.

  • Medium
  • Capital One
  • Data Manipulation (SQL/Python)
  • Data Scientist

Impute missing values without leakage

Company: Capital One

Role: Data Scientist

Category: Data Manipulation (SQL/Python)

Difficulty: Medium

Interview Round: HR Screen

Given a DataFrame df with columns: user_id, event_date (datetime), country (categorical), device_type (categorical), age (numeric), income (numeric), last_purchase_days_ago (numeric), session_length (numeric), is_active_30d (binary label). Implement code to impute missing values for model training with strict no‑leakage. Requirements: 1) Split into train/validation indices; all statistics/models for imputation must be fit on train only and then applied to validation. 2) Numeric: age → median within country (train‑only medians); income → train a ridge regression imputer on train rows using predictors [age_imputed, country, device_type, last_purchase_days_ago, session_length] (one‑hot encoded), then predict income for both train/validation; do not use the label. 3) Time‑ordered within user: for last_purchase_days_ago and session_length, sort by event_date per user_id and forward‑fill gaps up to 14 days; if the gap between consecutive event_date exceeds 14 days, do not propagate; after sequence fills, fill remaining NaNs with the global train median for that feature. 4) Categoricals: device_type and country → per‑user mode; break ties with the global train mode. 5) Deliver: functions fit_imputers(df, train_idx) and transform_impute(df, imputers, idx) where imputers holds all train‑fit objects/statistics; include assertions that no value derived from validation data was used to compute train statistics.

Quick Answer: This question evaluates proficiency in data imputation, strict train/validation leakage prevention, temporal per-user propagation rules, and categorical aggregation for model-ready features.

Last updated: Mar 29, 2026

Loading coding console...

PracHub

Master your tech interviews with 7,500+ real questions from top companies.

Product

  • Questions
  • Learning Tracks
  • Interview Guides
  • Resources
  • Premium
  • For Universities
  • Student Access

Browse

  • By Company
  • By Role
  • By Category
  • Topic Hubs
  • SQL Questions
  • Compare Platforms
  • Discord Community

Support

  • support@prachub.com
  • (916) 541-4762

Legal

  • Privacy Policy
  • Terms of Service
  • About Us

© 2026 PracHub. All rights reserved.

Related Coding Questions

  • Clean and Merge Housing Data - Capital One (easy)
  • Find Lowest Prices for Highly Rated Categories - Capital One (medium)
  • Write SQL to compute campaign net revenue - Capital One (Medium)
  • Merge CSVs and build revenue pivot with pandas - Capital One (Medium)
  • Find top category per region in Aug 2025 - Capital One (Medium)