Transform retail data with pandas groupby/merge/concat

Q: Transform retail data with pandas groupby/merge/concat

This question evaluates pandas-based data manipulation skills, specifically proficiency with groupby/agg, merge/concat, rolling-window and time-series aggregations, deduplication and tie-breaking logic for ranking and joins.

Q: How do I practice SQL interview questions?

PracHub provides an interactive SQL console where you can write and test queries against real database schemas. Get instant feedback and compare your solution with the expected output.

Q: What difficulty level is this coding question?

This is a medium difficulty Data Manipulation (SQL/Python) question, commonly asked during Technical Screen rounds at Amazon.

Q: What role is this question designed for?

This question is commonly asked for Data Scientist candidates at Amazon during technical interviews.

Question

Using pandas only (groupby/agg/merge/concat; no for-loops), write code to answer the sub-questions below on the following small dataframes. Assume timestamps are UTC.

users
user_id | signup_date | region | device
U1 | 2025-08-28 | US | iOS
U2 | 2025-08-30 | US | Web
U3 | 2025-09-01 | CA | Android
U4 | 2025-09-02 | US | Web

orders
order_id | user_id | ts | amount | category
O1 | U1 | 2025-09-01 10:05 | 20.00 | Books
O2 | U2 | 2025-09-02 09:10 | 35.00 | Home
O3 | U1 | 2025-09-02 11:00 | 15.00 | Books
O4 | U3 | 2025-09-03 12:00 | 50.00 | Games
O5 | U2 | 2025-09-03 13:30 | 60.00 | Home

events
user_id | ts | event
U1 | 2025-09-01 09:00 | view
U1 | 2025-09-01 09:05 | add_to_cart
U2 | 2025-09-02 09:00 | view
U3 | 2025-09-03 11:50 | view
U3 | 2025-09-03 11:55 | view
U4 | 2025-09-03 15:00 | view

Tasks:
1) Daily Active Users (DAU) by device: compute DAU per calendar day using events, then compute a 2-day rolling unique user count per device (aligned to day end). Explain how you ensure uniqueness across days.
2) First order: left-join users to first order per user to produce first_order_ts and days_to_first_order (float days), with NaN for users without orders. Be careful about users who have same-day signup and order.
3) Top categories per user: compute each user’s top-2 categories by total spend (ties broken alphabetically), returning columns top_cat_1 and top_cat_2; users with <2 categories should have NaN for missing values.
4) Add an "ALL" summary row that aggregates overall revenue by day across all devices and concatenate it to the per-device daily revenue table (schema: day, device, revenue). Ensure consistent column types after concat.
For each step, provide pandas code, resulting schema, and explain time/memory complexity and edge cases (empty joins, duplicate events, differing time zones).

PracHub · Accepted Answer

This question evaluates pandas-based data manipulation skills, specifically proficiency with groupby/agg, merge/concat, rolling-window and time-series aggregations, deduplication and tie-breaking logic for ranking and joins.

Quick Overview

Quick Overview