Transform messy transactions with pandas

Q: Transform messy transactions with pandas

This is a Data Manipulation (SQL/Python) interview question from Boston Consulting Group for Data Scientist roles. View the full question and solution on PracHub.

Q: How do I approach Data Manipulation (SQL/Python) interview questions?

Data Manipulation (SQL/Python) questions require understanding of core concepts and practice. PracHub provides solutions with explanations to help you master data manipulation (sql/python) interviews.

Question

You are given two CSVs.

transactions.csv

Columns: txn_id, user_id, ts_iso (ISO8601 with timezone), amount (decimal USD; refunds negative), merchant_cat, type {purchase, refund, chargeback}, updated_at (last write wins), dup_hint (string that can be identical across near-duplicate rows)
Sample rows: 1, U1, 2024-03-31T23:55:00-0700, 120.00, groceries, purchase, 2024-04-01T00:02:00Z, A 2, U1, 2024-03-31T23:58:10-0700, -20.00, groceries, refund, 2024-04-01T00:05:00Z, B 3, U2, 2024-04-01T07:01:00+0200, 15.00, digital, purchase, 2024-04-01T05:05:00Z, C 3, U2, 2024-04-01T07:01:00+0200, 15.00, digital, purchase, 2024-04-01T05:06:10Z, C (duplicate with later updated_at) 4, U3, 2024-04-02T10:00:00Z, 500.00, travel, purchase, 2024-04-02T10:01:00Z, D 5, U3, 2024-06-15T23:59:59-0400, 500.00, travel, chargeback, 2024-06-20T12:00:00Z, E 6, U1, 2024-06-01T00:00:10Z, 0.00, fees, purchase, 2024-06-01T00:00:20Z, F 7, U2, 2024-06-30T23:59:59-0700, 200.00, electronics, purchase, 2024-07-01T08:00:00Z, G

users.csv

Columns: user_id, signup_ts_iso, country, tz_name
Sample rows: U1, 2024-01-15T12:00:00Z, US, America/Los_Angeles U2, 2024-02-20T09:30:00Z, US, America/New_York U3, 2023-12-01T00:00:00Z, GB, Europe/London

Task (write idiomatic, production-ready pandas code without groupby.apply or explicit Python loops over rows; assume data can be 100M+ rows):

Read both files, parse timestamps, and normalize ts_iso to UTC. Deduplicate transactions by txn_id keeping only the row with the max updated_at. Sanity-check and drop rows where amount is NaN, type is invalid, or ts_iso is outside [2023-01-01, 2025-12-31].
Exclude refunds and chargebacks from spend features but keep them in a separate flag. Define net_spend as sum of amounts over type=='purchase' only.
Build month-level features per user for the window 2024-03-01 through 2024-07-31 (inclusive, calendar months in the user’s tz_name, but aggregated after converting to UTC to avoid DST duplication):

active_month (1 if user has ≥1 purchase in that local month, else 0),
monthly_net_spend (sum of positive purchase amounts in that local month),
rolling_3m_median_spend computed over active months only (skip months with active_month=0; do not fill implicit zeros), aligned to month end.

For each user-month, compute the top-3 merchant_cat by monthly_net_spend and emit them as categorical features cat1, cat2, cat3. Break ties by larger monthly_net_spend then lexicographic merchant_cat. If <3 categories exist, fill with 'None'.
Output one row per user for the snapshot date 2024-07-31 containing: user_id, months_active_last_5m, total_net_spend_last_5m, had_any_refund_last_5m (boolean from refunds/chargebacks), rolling_3m_median_spend_at_2024_07, cat1_2024_07, cat2_2024_07, cat3_2024_07. Ensure results are idempotent if you re-run on the same inputs, and memory-efficient (hint: use categorical dtypes, proper indexing, and avoid exploding intermediate DataFrames). Explain any edge cases you handle (DST boundaries, zero-amount rows, duplicate near-same rows via dup_hint).

Transform messy transactions with pandas

Comments (0)