Build panel in SQL; run causal regression
Company: Airbnb
Role: Data Scientist
Category: Data Manipulation (SQL/Python)
Difficulty: Medium
Interview Round: Technical Screen
Assume today is 2025-09-01 (UTC). Schema and small samples:
users(user_id INT, country STRING, signup_date DATE, platform STRING)
Sample:
user_id | country | signup_date | platform
1 | US | 2025-08-25 | ios
2 | US | 2025-08-20 | android
3 | CA | 2025-07-10 | web
4 | US | 2025-08-31 | ios
5 | GB | 2025-06-15 | android
exposures(user_id INT, ts TIMESTAMP, treatment INT)
Sample:
user_id | ts | treatment
1 | 2025-08-30T10:00:00Z | 1
1 | 2025-09-01T09:00:00Z | 1
2 | 2025-08-29T12:00:00Z | 0
3 | 2025-08-28T08:00:00Z | 1
5 | 2025-08-26T14:00:00Z | 0
orders(order_id INT, user_id INT, ts TIMESTAMP, amount DECIMAL(10,2))
Sample:
order_id | user_id | ts | amount
10 | 1 | 2025-08-30T11:00:00Z | 25.00
11 | 2 | 2025-08-29T13:00:00Z | 9.99
12 | 3 | 2025-08-31T09:00:00Z | 12.00
13 | 1 | 2025-09-01T10:15:00Z | 5.00
14 | 4 | 2025-09-01T16:00:00Z | 20.00
geo_rollout(country STRING, launch_ts TIMESTAMP)
Sample:
country | launch_ts
US | 2025-08-29T00:00:00Z
CA | 2025-08-27T00:00:00Z
GB | 2025-08-31T00:00:00Z
Tasks:
SQL A: Build a user-day panel for dates d in [2025-08-25, 2025-09-01]. For each user_id and date d, output: date, user_id, country, treated (1 if EXISTS exposure with ts <= d 23:59:59Z and treatment=1), post (1 if date(d) >= date(launch_ts for user’s country)), revenue_d (sum of order amounts with ts on d), active_d (1 if revenue_d > 0 OR EXISTS exposure on d), and signup_age_days. Ensure no duplicate user-days and fill missing user-days with zeros where appropriate.
SQL B: Using the panel, compute for US and GB the 7-day rolling revenue per active user on each day d in [2025-08-26, 2025-09-01], where active user means active_d=1 within the 7-day window. Clarify how you handle users with zero activity and avoid lookahead bias.
Python: From the panel, estimate a two-way fixed-effects DiD: revenue_d ~ post * treated + user FE + day FE, clustering SEs at the user level. (1) Interpret the interaction coefficient economically; (2) test for pre-trends using leads; (3) address late adopters/immortal time bias; (4) re-estimate with a count model (Poisson with exposure offset) and compare; (5) report a 95% CI and a practical recommendation.
Quick Answer: This question evaluates proficiency in SQL-based panel data construction, time-series aggregation and rolling-window metrics together with applied causal inference in Python, including two-way fixed-effects difference-in-differences, clustered standard errors, lead tests, and count-model re-specification; it targets Data Manipulation (SQL/Python) skills relevant for Data Scientist roles. It is commonly asked because it combines practical implementation and conceptual understanding—testing temporal joins, treatment timing, avoidance of lookahead/immortal-time biases and late-adopter issues, interpretation of interaction coefficients and confidence intervals, and the ability to reason about identification and model choice.