Compute C/T metrics from bookings and visits
Company: Airbnb
Role: Data Scientist
Category: Data Manipulation (SQL/Python)
Difficulty: Medium
Interview Round: Technical Screen
Given two tables, compute control vs treatment (C/T) metrics, apply 24‑hour attribution, and generate a daily plot. Treat “today” as 2025‑09‑01; use the last 7 days window 2025‑08‑26 through 2025‑09‑01 (inclusive). Provide either SQL or Python, and explain any assumptions.
Schema:
- visits(visit_id STRING, user_id STRING, visit_ts TIMESTAMP UTC, market_id STRING, channel STRING, variant CHAR(1) in {"C","T"})
- bookings(booking_id STRING, user_id STRING, booking_ts TIMESTAMP UTC, status STRING in {"confirmed","cancelled"}, revenue_usd DECIMAL(10,2))
Attribution rule: attribute each booking to the most recent visit by the same user within the prior 24 hours; otherwise leave unattributed. Deduplicate visits within 60 seconds per user_id by keeping only the latest. Exclude cancelled bookings from conversion and revenue; keep a diagnostic count of cancellations. Ignore bookings without an attributed visit for conversion, but report their count separately.
Required outputs for 2025‑08‑26..2025‑09‑01 by variant and by day:
- visits, unique_users, attributed_bookings (confirmed), conversion_rate = attributed_bookings / visits, total_revenue_usd (confirmed only), revenue_per_visit, cancellations, unattributed_confirmed_bookings.
- A line plot of daily conversion_rate with 95% Wilson CIs for C and T.
Edge cases to handle: a user with C and T visits in the window; multiple bookings after one visit; booking outside the 24‑hour window; duplicate visits within 60 seconds.
Small sample data (ASCII) to clarify shapes (not exhaustive):
visits
+----------+---------+---------------------+-----------+---------+---------+
| visit_id | user_id | visit_ts | market_id | channel | variant |
+----------+---------+---------------------+-----------+---------+---------+
| v1 | u101 | 2025-08-26 09:05:00 | NYC | seo | C |
| v2 | u101 | 2025-08-26 20:10:00 | NYC | direct | T |
| v3 | u102 | 2025-08-27 10:00:00 | SFO | paid | C |
| v4 | u103 | 2025-08-27 23:50:00 | NYC | email | T |
| v5 | u104 | 2025-08-28 00:10:00 | NYC | seo | T |
| v6 | u105 | 2025-08-28 12:40:00 | BOS | seo | C |
| v7 | u106 | 2025-08-30 21:55:00 | NYC | direct | T |
| v8 | u106 | 2025-08-30 21:56:00 | NYC | direct | T |
| v9 | u107 | 2025-09-01 09:30:00 | BOS | paid | C |
+----------+---------+---------------------+-----------+---------+---------+
bookings
+-------------+---------+---------------------+-----------+-------------+
| booking_id | user_id | booking_ts | status | revenue_usd |
+-------------+---------+---------------------+-----------+-------------+
| b1 | u101 | 2025-08-26 21:00:00 | confirmed | 220.00 |
| b2 | u102 | 2025-08-27 12:00:00 | confirmed | 150.00 |
| b3 | u103 | 2025-08-28 00:30:00 | confirmed | 300.00 |
| b4 | u104 | 2025-08-29 04:00:00 | confirmed | 180.00 |
| b5 | u106 | 2025-08-31 22:10:00 | confirmed | 250.00 |
| b6 | u105 | 2025-08-30 13:00:00 | cancelled | 200.00 |
| b7 | u108 | 2025-08-29 14:00:00 | confirmed | 120.00 |
+-------------+---------+---------------------+-----------+-------------+
Tasks:
A) Recompute attribution from visits using the 24‑hour rule and deduped visits; produce the required metrics by variant and by day.
B) Output a tidy table with one row per (date, variant) and the metrics above.
C) Produce the described plot.
Quick Answer: This question evaluates time-window attribution, visit deduplication, control vs treatment metric computation, aggregation of conversions and revenue, and statistical interval estimation (Wilson CIs), testing practical data manipulation skills with SQL or Python in the Data Manipulation (SQL/Python) domain.