Transform event logs with subscription windows in pandas
Company: Amazon
Role: Data Scientist
Category: Data Manipulation (SQL/Python)
Difficulty: Medium
Interview Round: Onsite
Using pandas, compute user-level subscription-aligned revenue and anomalies for September 2025. DataFrames: events(user_id:int, ts:UTC datetime, event:str in {'view','add_to_cart','purchase'}, product_id:int, price_usd:float), subs(user_id:int, plan:str, start_ts:UTC datetime, end_ts:UTC datetime or NaT). Requirements: (1) For each user, compute active_subscription_days in 2025-09 and total purchase revenue that occurred while the user was actively subscribed (purchase ts ∈ [start_ts, end_ts)); (2) Flag purchases outside any active window; (3) If a user has overlapping or back-to-back subscriptions, merge them into minimal disjoint half-open intervals before attribution; (4) Output two DataFrames: user_month_agg(user_id, month='2025-09', active_subscription_days:int, subscribed_purchase_revenue:float, out_of_window_purchases:int) and anomalies(user_id, ts, price_usd, reason='outside_window'|'overlap_fixed'); (5) Solve with vectorized operations (e.g., IntervalIndex, merge_asof, or interval trees) and discuss scalability to 100M events with limited RAM (chunking, dtype optimization, parquet scans). Small sample:
subs:
(1,'pro','2025-08-28T00:00Z','2025-09-10T00:00Z')
(1,'pro','2025-09-10T00:00Z','2025-10-10T00:00Z')
(2,'basic','2025-09-05T12:00Z',NaT)
events:
(1,'2025-09-09T22:00Z','purchase',101,19.99)
(1,'2025-09-15T03:00Z','purchase',102,5.00)
(2,'2025-09-01T01:00Z','purchase',103,9.99)
Quick Answer: This question evaluates a candidate's competency in time-based event attribution and temporal interval manipulation using pandas, covering interval merging, subscription-aligned revenue aggregation, anomaly flagging, and handling edge cases like overlapping or back-to-back subscriptions.