Write SQL for noisy A/B launch metrics
Company: Chime
Role: Data Scientist
Category: Data Manipulation (SQL/Python)
Difficulty: Medium
Interview Round: Technical Screen
PostgreSQL. Today is 2025-09-01. You’re given the following schema and toy samples:
users(user_id INT, country TEXT, signup_date DATE, marketing_channel TEXT)
+---------+---------+-------------+------------------+
| user_id | country | signup_date | marketing_channel|
+---------+---------+-------------+------------------+
| 1 | US | 2025-08-12 | email |
| 2 | CA | 2025-08-13 | paid |
| 3 | US | 2025-08-11 | organic |
| 4 | GB | 2025-08-17 | paid |
+---------+---------+-------------+------------------+
exposures(user_id INT, experiment_id TEXT, variant TEXT, exposure_ts TIMESTAMP, source TEXT)
+---------+--------------------+---------+---------------------+--------+
| user_id | experiment_id | variant | exposure_ts | source |
+---------+--------------------+---------+---------------------+--------+
| 1 | tracker_vlaunch_v1 | control | 2025-08-12 10:00 | web |
| 2 | tracker_vlaunch_v1 | B | 2025-08-13 09:35 | email |
| 2 | tracker_vlaunch_v1 | control | 2025-08-14 10:18 | web | <- crossover
| 3 | tracker_vlaunch_v1 | B | 2025-08-11 08:02 | paid |
| 4 | tracker_vlaunch_v1 | B | 2025-08-17 15:10 | web |
+---------+--------------------+---------+---------------------+--------+
sessions(user_id INT, session_start TIMESTAMP, device TEXT, is_bot BOOLEAN)
+---------+---------------------+---------+--------+
| user_id | session_start | device | is_bot |
+---------+---------------------+---------+--------+
| 1 | 2025-08-12 10:00 | ios | false |
| 2 | 2025-08-13 09:30 | web | false |
| 2 | 2025-08-14 10:10 | web | true |
| 3 | 2025-08-11 08:00 | android | false |
| 4 | 2025-08-17 15:00 | web | false |
+---------+---------------------+---------+--------+
orders(order_id INT, user_id INT, order_ts TIMESTAMP, revenue NUMERIC(10,2), refunded BOOLEAN)
+----------+---------+---------------------+---------+----------+
| order_id | user_id | order_ts | revenue | refunded |
+----------+---------+---------------------+---------+----------+
| 501 | 1 | 2025-08-20 12:00 | 99.00 | false |
| 502 | 2 | 2025-08-28 11:05 | 59.00 | true |
| 503 | 3 | 2025-08-13 09:00 | 49.00 | false |
| 504 | 4 | 2025-08-30 13:00 | 129.00 | false |
+----------+---------+---------------------+---------+----------+
Task A — core metrics (CTEs allowed): For experiment_id = 'tracker_vlaunch_v1' and exposures between 2025-08-01 and 2025-08-24, write a single query that:
- Assigns each user to the earliest exposure variant and censors any later crossovers.
- Qualifies users as non-bot if they have at least one session with is_bot = false on or before their earliest exposure.
- Computes, per variant, by country in {US, CA} only: users_exposed, qualified_users, converters_14d (users with an order within 14 days of earliest exposure and order_ts < '2025-09-01'), conv_rate_14d, revenue_14d (sum of non-refunded revenue within the same 14-day window), refund_rate_14d (refunded orders / all orders in window).
- Excludes orders whose order_ts falls in the iOS outage window ['2025-08-10', '2025-08-12') if the user’s earliest qualifying device was ios.
Task B — SRM diagnostic: Write a second query that tests for sample ratio mismatch across variants among qualified users using a chi-square goodness-of-fit test vs. expected 50/50. Output observed_count_control, observed_count_B, expected_each, chi2_stat, and an approximate p_value (you may use built-in ln(), exp(), and power() but no UDFs).
Edge cases to handle: users with no sessions; users with only bot sessions; users with orders before exposure; multiple orders within 14 days; users outside US/CA; and the crossover illustrated for user_id=2.
Quick Answer: This question evaluates SQL-based data manipulation and experimental analytics skills, including attribution of earliest exposures, censoring crossovers, session-based bot qualification, time-windowed conversion and revenue calculations, and exclusion of outage periods in PostgreSQL.