Transform retail data with pandas groupby/merge/concat
Company: Amazon
Role: Data Scientist
Category: Data Manipulation (SQL/Python)
Difficulty: Medium
Interview Round: Technical Screen
Using pandas only (groupby/agg/merge/concat; no for-loops), write code to answer the sub-questions below on the following small dataframes. Assume timestamps are UTC.
users
user_id | signup_date | region | device
U1 | 2025-08-28 | US | iOS
U2 | 2025-08-30 | US | Web
U3 | 2025-09-01 | CA | Android
U4 | 2025-09-02 | US | Web
orders
order_id | user_id | ts | amount | category
O1 | U1 | 2025-09-01 10:05 | 20.00 | Books
O2 | U2 | 2025-09-02 09:10 | 35.00 | Home
O3 | U1 | 2025-09-02 11:00 | 15.00 | Books
O4 | U3 | 2025-09-03 12:00 | 50.00 | Games
O5 | U2 | 2025-09-03 13:30 | 60.00 | Home
events
user_id | ts | event
U1 | 2025-09-01 09:00 | view
U1 | 2025-09-01 09:05 | add_to_cart
U2 | 2025-09-02 09:00 | view
U3 | 2025-09-03 11:50 | view
U3 | 2025-09-03 11:55 | view
U4 | 2025-09-03 15:00 | view
Tasks:
1) Daily Active Users (DAU) by device: compute DAU per calendar day using events, then compute a 2-day rolling unique user count per device (aligned to day end). Explain how you ensure uniqueness across days.
2) First order: left-join users to first order per user to produce first_order_ts and days_to_first_order (float days), with NaN for users without orders. Be careful about users who have same-day signup and order.
3) Top categories per user: compute each user’s top-2 categories by total spend (ties broken alphabetically), returning columns top_cat_1 and top_cat_2; users with <2 categories should have NaN for missing values.
4) Add an "ALL" summary row that aggregates overall revenue by day across all devices and concatenate it to the per-device daily revenue table (schema: day, device, revenue). Ensure consistent column types after concat.
For each step, provide pandas code, resulting schema, and explain time/memory complexity and edge cases (empty joins, duplicate events, differing time zones).
Quick Answer: This question evaluates pandas-based data manipulation skills, specifically proficiency with groupby/agg, merge/concat, rolling-window and time-series aggregations, deduplication and tie-breaking logic for ranking and joins.