Write SQL and pandas for shopping events
Company: Pinterest
Role: Data Scientist
Category: Data Manipulation (SQL/Python)
Difficulty: Medium
Interview Round: Technical Screen
Use the schema and sample data below to answer SQL and pandas tasks. Treat 'today' as 2025-09-01.
Schema
users(user_id INT, country STRING)
pins(pin_id INT, creator_user_id INT, is_shopping_enabled BOOLEAN, created_at DATE)
events(event_id INT, user_id INT, pin_id INT, event_type STRING, event_ts TIMESTAMP, stay_time_sec INT, feature STRING, product_category_code STRING)
product_categories(code STRING, name STRING)
Sample tables (small)
users
user_id | country
1 | US
2 | US
3 | CA
pins
pin_id | creator_user_id | is_shopping_enabled | created_at
10 | 3 | true | 2025-08-20
11 | 2 | false | 2025-08-22
12 | 1 | true | 2025-08-25
events
event_id | user_id | pin_id | event_type | event_ts | stay_time_sec | feature | product_category_code
101 | 1 | 10 | shopping_click | 2025-08-30 10:00:00 | 35 | shopping_module | A
102 | 1 | 10 | shopping_click | 2025-09-01 09:00:00 | 50 | shopping_module | null
103 | 2 | 12 | shopping_click | 2025-08-31 12:00:00 | null | shopping_module | H
104 | 2 | 11 | view | 2025-09-01 08:00:00 | null | feed | null
105 | 3 | 10 | shopping_click | 2025-08-28 07:00:00 | 15 | shopping_module | A
106 | 3 | 10 | shopping_click | 2025-09-01 11:59:00 | -5 | shopping_module | ?
product_categories
code | name
A | Apparel
H | Home
? | Unknown
Tasks
SQL
1) Daily shopping engagement last 7 days: Write a single SQL query that returns, for each date d in [2025-08-26, 2025-09-01], the columns (d, dau_shopping, clicks, avg_stay_time_pos_sec, rolling_7d_uniq_users). Count only events where event_type='shopping_click' and feature='shopping_module'. Deduplicate rapid repeat clicks per (user_id, pin_id) if they occur within 5 minutes (treat as one); implement dedup with window functions. Exclude stay_time_sec <= 0 or NULL from the average, but still count those clicks in 'clicks'. Ensure dates with no activity appear with zeros using a generated dates CTE.
2) Next-day retention: Among users with at least one shopping_click on 2025-08-31, compute the percentage that also have a shopping_click on 2025-09-01.
3) Top pins per user: For the 7 days ending 2025-09-01, return for each user their top 2 pins by number of deduplicated shopping_clicks; break ties by greater total positive stay_time_sec, then by smallest pin_id.
Python (pandas)
Given a DataFrame events_df with the same columns as events: (a) Map product_category_code using dict = {'A':'Apparel','H':'Home','?':'Unknown'} so missing/unknown codes become 'Unknown'. (b) Replace negative stay_time_sec with NaN; fill remaining NaN stay_time_sec with 0 for aggregation but exclude zeros from averages where appropriate. (c) Sort events_df by ['user_id' asc, 'event_ts' desc, 'stay_time_sec' desc]. (d) Compute, for the last 7 days ending 2025-09-01, each user's top category by total positive stay_time_sec and return a Series user_id -> top_category (tie-breaker: alphabetical).
Quick Answer: This question evaluates a candidate's ability to perform data manipulation in SQL and pandas, covering event deduplication, time-window aggregations, next-day retention calculations, top-N ranking, joins, and data cleaning/mapping for analytics.