Write SQL for snapshot features and labels

Q: Write SQL for snapshot features and labels

This is a Data Manipulation (SQL/Python) interview question from Stripe for Data Scientist roles. View the full question and solution on PracHub.

Q: How do I approach Data Manipulation (SQL/Python) interview questions?

Data Manipulation (SQL/Python) questions require understanding of core concepts and practice. PracHub provides solutions with explanations to help you master data manipulation (sql/python) interviews.

Question

Use 'today' = 2025-09-01. Invented schema (UTC):

users(u_id INT, signup_dt DATE, country STRING) events(u_id INT, event_time TIMESTAMP, arrival_time TIMESTAMP, event_type STRING, session_id STRING) orders(order_id INT, u_id INT, order_time TIMESTAMP, amount DECIMAL(10,2))

Sample rows: users u_id | signup_dt | country 1 | 2025-07-20 | US 2 | 2025-08-25 | CA 3 | 2025-08-30 | US

events u_id | event_time | arrival_time | event_type | session_id 1 | 2025-08-29 10:00:00 | 2025-08-29 10:00:02 | view | a 1 | 2025-08-31 09:00:00 | 2025-08-31 09:00:03 | add_to_cart | a 1 | 2025-09-01 12:00:00 | 2025-09-01 12:00:01 | view | b 2 | 2025-08-30 15:00:00 | 2025-08-30 15:00:02 | view | c 2 | 2025-09-01 09:30:00 | 2025-09-01 09:30:01 | view | d 3 | 2025-08-31 23:59:00 | 2025-09-02 00:00:05 | view | e (late-arriving) 3 | 2025-09-01 08:00:00 | 2025-09-01 08:00:00 | view | e 3 | 2025-09-01 08:00:00 | 2025-09-01 08:00:00 | view | e (duplicate)

orders order_id | u_id | order_time | amount 101 | 1 | 2025-09-10 13:00:00 | 120.00 102 | 2 | 2025-08-31 16:00:00 | 50.00 103 | 2 | 2025-09-03 11:00:00 | 60.00

Task: Write one SQL query that produces a training dataset at anchor snapshot_ts = '2025-09-01 00:00:00' with columns: (u_id, country, days_since_signup, last_event_time, events_7d, sessions_7d, added_to_cart_7d, y_30d). Rules: (A) Features use only events with event_time < snapshot_ts AND arrival_time <= snapshot_ts; drop duplicates by (u_id, event_time, event_type, session_id). (B) Compute 7-day windows over [snapshot_ts - 7 days, snapshot_ts). (C) sessions_7d counts distinct session_id in window; events_7d counts distinct (event_time, event_type, session_id). (D) y_30d = 1 if there exists an order in [snapshot_ts, snapshot_ts + INTERVAL '30' DAY), else 0. (E) Include users with no events (fill nulls/zeros appropriately). Show the exact row outputs for the provided sample data and explain why user 3's late-arriving 2025-08-31 event must be excluded from features.

Write SQL for snapshot features and labels

Comments (0)