Join datasets and compute conversion by assignment
Company: Meta
Role: Data Scientist
Category: Data Manipulation (SQL/Python)
Difficulty: Medium
Interview Round: Technical Screen
You are given two CSVs. Create tables and write SQL to produce both visit-level and visitor-level conversion datasets, then aggregate conversion by assignment and country. Use the following schema and sample data.
Schema:
- visit(id_visitor BIGINT, ts TIMESTAMP, country STRING, assign TINYINT)
- booking(id_booking BIGINT, id_visitor BIGINT, ts TIMESTAMP)
Sample tables (timestamps are UTC):
visit
+------------+---------------------+---------+--------+
| id_visitor | ts | country | assign |
+------------+---------------------+---------+--------+
| 101 | 2025-01-03 09:12:00 | US | 1 |
| 101 | 2025-01-05 10:00:00 | US | 1 |
| 102 | 2025-01-04 14:30:00 | CA | 0 |
| 103 | 2025-01-04 15:00:00 | US | 1 |
| 104 | 2025-01-06 08:00:00 | GB | 0 |
| 105 | 2025-01-06 09:10:00 | US | 0 |
+------------+---------------------+---------+--------+
booking
+------------+------------+---------------------+
| id_booking | id_visitor | ts |
+------------+------------+---------------------+
| 5001 | 101 | 2025-01-05 12:00:00 |
| 5002 | 102 | 2025-01-04 16:00:00 |
| 5003 | 103 | 2025-01-10 09:00:00 |
| 5004 | 101 | 2025-01-03 08:00:00 |
| 5005 | 105 | 2025-02-01 10:00:00 |
+------------+------------+---------------------+
Requirements:
1) Visit-level dataset: one row per visit with columns (id_visitor, visit_ts, country, assign, booked_flag). booked_flag=1 if there exists a booking for the same id_visitor with booking.ts >= visit.ts and < min(next_visit.ts, visit.ts + INTERVAL 28 DAY); otherwise 0. Ensure a single booking is not double-counted across multiple visits for the same visitor.
2) Visitor-level dataset: one row per visitor with columns (id_visitor, first_visit_ts, country_at_first_visit, assign_at_first_visit, booked_flag_28d). booked_flag_28d=1 if any booking.ts is in [first_visit_ts, first_visit_ts + 28 days); otherwise 0. If a visitor has conflicting assign values across visits, use the earliest observed assign.
3) Aggregations: for each of the two datasets, output counts by (assign, country): visits_or_visitors, bookers, conversion = bookers / visits_or_visitors. Be explicit about handling duplicates and timezone assumptions.
Provide ANSI SQL (CTEs allowed) that runs on a typical data warehouse (e.g., BigQuery/Snowflake/Postgres) and produces the specified aggregations.
Quick Answer: This question evaluates data manipulation skills around time-based joins, event attribution, deduplication, and conversion metric computation within the Data Manipulation (SQL/Python) domain.