Transform messy transactions with pandas

Q: Transform messy transactions with pandas

This question evaluates proficiency in large-scale data manipulation and feature engineering using pandas, covering timestamp normalization across time zones, deduplication and last-write-wins semantics, data validation, memory-efficient aggregation, rolling-window statistics, and top-k categorical extraction for merchant categories.

Q: How do I practice SQL interview questions?

PracHub provides an interactive SQL console where you can write and test queries against real database schemas. Get instant feedback and compare your solution with the expected output.

Q: What difficulty level is this coding question?

This is a Medium difficulty Data Manipulation (SQL/Python) question, commonly asked during Technical Screen rounds at Boston Consulting Group.

Q: What role is this question designed for?

This question is commonly asked for Data Scientist candidates at Boston Consulting Group during technical interviews.

Question

You are given two CSVs.

transactions.csv
- Columns: txn_id, user_id, ts_iso (ISO8601 with timezone), amount (decimal USD; refunds negative), merchant_cat, type {purchase, refund, chargeback}, updated_at (last write wins), dup_hint (string that can be identical across near-duplicate rows)
- Sample rows:
 1, U1, 2024-03-31T23:55:00-0700, 120.00, groceries, purchase, 2024-04-01T00:02:00Z, A
 2, U1, 2024-03-31T23:58:10-0700, -20.00, groceries, refund, 2024-04-01T00:05:00Z, B
 3, U2, 2024-04-01T07:01:00+0200, 15.00, digital, purchase, 2024-04-01T05:05:00Z, C
 3, U2, 2024-04-01T07:01:00+0200, 15.00, digital, purchase, 2024-04-01T05:06:10Z, C (duplicate with later updated_at)
 4, U3, 2024-04-02T10:00:00Z, 500.00, travel, purchase, 2024-04-02T10:01:00Z, D
 5, U3, 2024-06-15T23:59:59-0400, 500.00, travel, chargeback, 2024-06-20T12:00:00Z, E
 6, U1, 2024-06-01T00:00:10Z, 0.00, fees, purchase, 2024-06-01T00:00:20Z, F
 7, U2, 2024-06-30T23:59:59-0700, 200.00, electronics, purchase, 2024-07-01T08:00:00Z, G

users.csv
- Columns: user_id, signup_ts_iso, country, tz_name
- Sample rows:
 U1, 2024-01-15T12:00:00Z, US, America/Los_Angeles
 U2, 2024-02-20T09:30:00Z, US, America/New_York
 U3, 2023-12-01T00:00:00Z, GB, Europe/London

Task (write idiomatic, production-ready pandas code without groupby.apply or explicit Python loops over rows; assume data can be 100M+ rows):
1) Read both files, parse timestamps, and normalize ts_iso to UTC. Deduplicate transactions by txn_id keeping only the row with the max updated_at. Sanity-check and drop rows where amount is NaN, type is invalid, or ts_iso is outside [2023-01-01, 2025-12-31].
2) Exclude refunds and chargebacks from spend features but keep them in a separate flag. Define net_spend as sum of amounts over type=='purchase' only.
3) Build month-level features per user for the window 2024-03-01 through 2024-07-31 (inclusive, calendar months in the user’s tz_name, but aggregated after converting to UTC to avoid DST duplication):
 - active_month (1 if user has ≥1 purchase in that local month, else 0),
 - monthly_net_spend (sum of positive purchase amounts in that local month),
 - rolling_3m_median_spend computed over active months only (skip months with active_month=0; do not fill implicit zeros), aligned to month end.
4) For each user-month, compute the top-3 merchant_cat by monthly_net_spend and emit them as categorical features cat1, cat2, cat3. Break ties by larger monthly_net_spend then lexicographic merchant_cat. If <3 categories exist, fill with 'None'.
5) Output one row per user for the snapshot date 2024-07-31 containing: user_id, months_active_last_5m, total_net_spend_last_5m, had_any_refund_last_5m (boolean from refunds/chargebacks), rolling_3m_median_spend_at_2024_07, cat1_2024_07, cat2_2024_07, cat3_2024_07. Ensure results are idempotent if you re-run on the same inputs, and memory-efficient (hint: use categorical dtypes, proper indexing, and avoid exploding intermediate DataFrames). Explain any edge cases you handle (DST boundaries, zero-amount rows, duplicate near-same rows via dup_hint).

PracHub · Accepted Answer

This question evaluates proficiency in large-scale data manipulation and feature engineering using pandas, covering timestamp normalization across time zones, deduplication and last-write-wins semantics, data validation, memory-efficient aggregation, rolling-window statistics, and top-k categorical extraction for merchant categories.

Quick Overview

Quick Overview