Write SQL/Python for messy event data
Company: Google
Role: Data Scientist
Category: Data Manipulation (SQL/Python)
Difficulty: Medium
Interview Round: Technical Screen
Using the schema and sample data below, write: (1) a single SQL query to compute daily metrics for the local date 2025-09-01 in America/Los_Angeles, and (2) a Python (pandas) transformation.
Goal A (SQL): For 2025-09-01 (America/Los_Angeles local date derived from UTC timestamps), output one row with:
- new_buyers: users whose first paid order occurred on that local date, excluding paid orders refunded within 24 hours of order_time_utc and all canceled orders.
- cart_to_paid_new: among users counted as new_buyers, the share who had at least one add_to_cart event in the 7 local days prior to their first paid order; deduplicate add_to_cart events within the same session_id by collapsing events that are ≤10 minutes apart into one.
- cart_to_paid_returning: same conversion on that local date for users who placed a paid order that day but whose first paid order was before 2025-09-01.
- srm_p_value: a sample-ratio-mismatch p-value for the treatment vs control split among users observed on 2025-09-01 (based on experiments.variant), assuming 50/50 expected split.
Assumptions: timestamps are stored in UTC; convert to America/Los_Angeles for local dates; if multiple paid orders exist on the first-paid day, treat the earliest qualifying paid order as first.
Goal B (Python/pandas): Given a DataFrame events with the columns below (including a dict-like props_json), produce a DataFrame with columns [user_id, local_dt, sku, dedup_add_to_cart_cnt] where local_dt is the America/Los_Angeles date of event_time_utc; deduplicate add_to_cart within each [user_id, session_id, sku] by treating any subsequent add_to_cart within 2 minutes as the same action; count distinct deduped add_to_cart per [user_id, local_dt, sku]. Show idiomatic pandas code (no UDFs) and explain time zone handling.
Schema and small ASCII samples:
users
---------
user_id | signup_utc | referrer
1 | 2025-08-20 13:02:10 | ads
2 | 2025-08-28 21:50:05 | seo
3 | 2025-08-31 02:11:34 | direct
4 | 2025-09-01 04:00:12 | ads
5 | 2025-09-01 05:47:40 | partner
events
---------------------------------------------------------------------------------------------
user_id | event_time_utc | event_type | session_id | device | props_json
1 | 2025-08-30 16:00:00 | add_to_cart | s1 | ios | {"sku":"A1","qty":1}
1 | 2025-08-30 16:01:00 | add_to_cart | s1 | ios | {"sku":"A1","qty":1}
1 | 2025-08-30 16:05:00 | purchase_click | s1 | ios | {}
2 | 2025-09-01 01:02:03 | add_to_cart | s2 | web | {"sku":"B2","qty":2}
3 | 2025-08-25 09:10:00 | add_to_cart | s3 | android | {"sku":"C3","qty":1}
orders
--------------------------------------------------------------------------------------
order_id | user_id | order_time_utc | amount_usd | status | refund_time_utc
10 | 1 | 2025-09-01 16:15:00 | 120.00 | paid | 2025-09-01 18:00:00
11 | 1 | 2025-09-01 16:20:00 | 35.00 | canceled | null
12 | 2 | 2025-09-01 02:15:00 | 50.00 | paid | null
13 | 3 | 2025-08-26 11:00:00 | 20.00 | paid | 2025-08-27 08:00:00
14 | 4 | 2025-09-02 07:00:00 | 15.00 | paid | null
experiments
-----------------------------
user_id | dt | variant
1 | 2025-09-01 | treatment
2 | 2025-09-01 | control
3 | 2025-09-01 | treatment
4 | 2025-09-02 | control
5 | 2025-09-01 | treatment
Quick Answer: This question evaluates SQL and pandas proficiency for event-level data manipulation, including time zone–aware local date conversion, session-based deduplication, cohort and conversion metric calculation, and basic experiment sanity checking via a sample-ratio-mismatch p-value.