PracHub
QuestionsPremiumLearningGuidesCheatsheetNEWCoaches

Quick Overview

This question evaluates proficiency in large-scale data manipulation and feature engineering using pandas, covering timestamp normalization across time zones, deduplication and last-write-wins semantics, data validation, memory-efficient aggregation, rolling-window statistics, and top-k categorical extraction for merchant categories.

  • Medium
  • Boston Consulting Group
  • Data Manipulation (SQL/Python)
  • Data Scientist

Transform messy transactions with pandas

Company: Boston Consulting Group

Role: Data Scientist

Category: Data Manipulation (SQL/Python)

Difficulty: Medium

Interview Round: Technical Screen

You are given two CSVs. transactions.csv - Columns: txn_id, user_id, ts_iso (ISO8601 with timezone), amount (decimal USD; refunds negative), merchant_cat, type {purchase, refund, chargeback}, updated_at (last write wins), dup_hint (string that can be identical across near-duplicate rows) - Sample rows: 1, U1, 2024-03-31T23:55:00-0700, 120.00, groceries, purchase, 2024-04-01T00:02:00Z, A 2, U1, 2024-03-31T23:58:10-0700, -20.00, groceries, refund, 2024-04-01T00:05:00Z, B 3, U2, 2024-04-01T07:01:00+0200, 15.00, digital, purchase, 2024-04-01T05:05:00Z, C 3, U2, 2024-04-01T07:01:00+0200, 15.00, digital, purchase, 2024-04-01T05:06:10Z, C (duplicate with later updated_at) 4, U3, 2024-04-02T10:00:00Z, 500.00, travel, purchase, 2024-04-02T10:01:00Z, D 5, U3, 2024-06-15T23:59:59-0400, 500.00, travel, chargeback, 2024-06-20T12:00:00Z, E 6, U1, 2024-06-01T00:00:10Z, 0.00, fees, purchase, 2024-06-01T00:00:20Z, F 7, U2, 2024-06-30T23:59:59-0700, 200.00, electronics, purchase, 2024-07-01T08:00:00Z, G users.csv - Columns: user_id, signup_ts_iso, country, tz_name - Sample rows: U1, 2024-01-15T12:00:00Z, US, America/Los_Angeles U2, 2024-02-20T09:30:00Z, US, America/New_York U3, 2023-12-01T00:00:00Z, GB, Europe/London Task (write idiomatic, production-ready pandas code without groupby.apply or explicit Python loops over rows; assume data can be 100M+ rows): 1) Read both files, parse timestamps, and normalize ts_iso to UTC. Deduplicate transactions by txn_id keeping only the row with the max updated_at. Sanity-check and drop rows where amount is NaN, type is invalid, or ts_iso is outside [2023-01-01, 2025-12-31]. 2) Exclude refunds and chargebacks from spend features but keep them in a separate flag. Define net_spend as sum of amounts over type=='purchase' only. 3) Build month-level features per user for the window 2024-03-01 through 2024-07-31 (inclusive, calendar months in the user’s tz_name, but aggregated after converting to UTC to avoid DST duplication): - active_month (1 if user has ≥1 purchase in that local month, else 0), - monthly_net_spend (sum of positive purchase amounts in that local month), - rolling_3m_median_spend computed over active months only (skip months with active_month=0; do not fill implicit zeros), aligned to month end. 4) For each user-month, compute the top-3 merchant_cat by monthly_net_spend and emit them as categorical features cat1, cat2, cat3. Break ties by larger monthly_net_spend then lexicographic merchant_cat. If <3 categories exist, fill with 'None'. 5) Output one row per user for the snapshot date 2024-07-31 containing: user_id, months_active_last_5m, total_net_spend_last_5m, had_any_refund_last_5m (boolean from refunds/chargebacks), rolling_3m_median_spend_at_2024_07, cat1_2024_07, cat2_2024_07, cat3_2024_07. Ensure results are idempotent if you re-run on the same inputs, and memory-efficient (hint: use categorical dtypes, proper indexing, and avoid exploding intermediate DataFrames). Explain any edge cases you handle (DST boundaries, zero-amount rows, duplicate near-same rows via dup_hint).

Quick Answer: This question evaluates proficiency in large-scale data manipulation and feature engineering using pandas, covering timestamp normalization across time zones, deduplication and last-write-wins semantics, data validation, memory-efficient aggregation, rolling-window statistics, and top-k categorical extraction for merchant categories.

Last updated: Mar 29, 2026

Loading coding console...

PracHub

Master your tech interviews with 7,500+ real questions from top companies.

Product

  • Questions
  • Learning Tracks
  • Interview Guides
  • Resources
  • Premium
  • For Universities
  • Student Access

Browse

  • By Company
  • By Role
  • By Category
  • Topic Hubs
  • SQL Questions
  • Compare Platforms
  • Discord Community

Support

  • support@prachub.com
  • (916) 541-4762

Legal

  • Privacy Policy
  • Terms of Service
  • About Us

© 2026 PracHub. All rights reserved.

Related Coding Questions

  • Unify 7 tables and impute missing values - Boston Consulting Group (Medium)
  • Query top spenders and 7-day growth - Boston Consulting Group (Medium)
  • Manipulate and merge DataFrames correctly - Boston Consulting Group (Medium)
  • Transform and aggregate messy event data - Boston Consulting Group (Medium)
  • Merge and Concatenate Inconsistent Order Files with Pandas - Boston Consulting Group (Medium)