You are given two CSVs.
transactions.csv
-
Columns: txn_id, user_id, ts_iso (ISO8601 with timezone), amount (decimal USD; refunds negative), merchant_cat, type {purchase, refund, chargeback}, updated_at (last write wins), dup_hint (string that can be identical across near-duplicate rows)
-
Sample rows:
1, U1, 2024-03-31T23:55:00-0700, 120.00, groceries, purchase, 2024-04-01T00:02:00Z, A
2, U1, 2024-03-31T23:58:10-0700, -20.00, groceries, refund, 2024-04-01T00:05:00Z, B
3, U2, 2024-04-01T07:01:00+0200, 15.00, digital, purchase, 2024-04-01T05:05:00Z, C
3, U2, 2024-04-01T07:01:00+0200, 15.00, digital, purchase, 2024-04-01T05:06:10Z, C (duplicate with later updated_at)
4, U3, 2024-04-02T10:00:00Z, 500.00, travel, purchase, 2024-04-02T10:01:00Z, D
5, U3, 2024-06-15T23:59:59-0400, 500.00, travel, chargeback, 2024-06-20T12:00:00Z, E
6, U1, 2024-06-01T00:00:10Z, 0.00, fees, purchase, 2024-06-01T00:00:20Z, F
7, U2, 2024-06-30T23:59:59-0700, 200.00, electronics, purchase, 2024-07-01T08:00:00Z, G
users.csv
-
Columns: user_id, signup_ts_iso, country, tz_name
-
Sample rows:
U1, 2024-01-15T12:00:00Z, US, America/Los_Angeles
U2, 2024-02-20T09:30:00Z, US, America/New_York
U3, 2023-12-01T00:00:00Z, GB, Europe/London
Task (write idiomatic, production-ready pandas code without groupby.apply or explicit Python loops over rows; assume data can be 100M+ rows):
-
Read both files, parse timestamps, and normalize ts_iso to UTC. Deduplicate transactions by txn_id keeping only the row with the max updated_at. Sanity-check and drop rows where amount is NaN, type is invalid, or ts_iso is outside [2023-01-01, 2025-12-31].
-
Exclude refunds and chargebacks from spend features but keep them in a separate flag. Define net_spend as sum of amounts over type=='purchase' only.
-
Build month-level features per user for the window 2024-03-01 through 2024-07-31 (inclusive, calendar months in the user’s tz_name, but aggregated after converting to UTC to avoid DST duplication):
-
active_month (1 if user has ≥1 purchase in that local month, else 0),
-
monthly_net_spend (sum of positive purchase amounts in that local month),
-
rolling_3m_median_spend computed over active months only (skip months with active_month=0; do not fill implicit zeros), aligned to month end.
-
For each user-month, compute the top-3 merchant_cat by monthly_net_spend and emit them as categorical features cat1, cat2, cat3. Break ties by larger monthly_net_spend then lexicographic merchant_cat. If <3 categories exist, fill with 'None'.
-
Output one row per user for the snapshot date 2024-07-31 containing: user_id, months_active_last_5m, total_net_spend_last_5m, had_any_refund_last_5m (boolean from refunds/chargebacks), rolling_3m_median_spend_at_2024_07, cat1_2024_07, cat2_2024_07, cat3_2024_07. Ensure results are idempotent if you re-run on the same inputs, and memory-efficient (hint: use categorical dtypes, proper indexing, and avoid exploding intermediate DataFrames). Explain any edge cases you handle (DST boundaries, zero-amount rows, duplicate near-same rows via dup_hint).