Write SQL for post-trial conversion cohorts
Company: OpenAI
Role: Data Scientist
Category: Data Manipulation (SQL/Python)
Difficulty: Medium
Interview Round: Technical Screen
Using the schema below, write SQL to compute, for users first exposed between 2025‑06‑01 and 2025‑06‑30 (inclusive), the intent‑to‑treat paid conversion rate within 60 days of first exposure, by variant (control vs. trial). Count each user once using their first exposure only. Exclude users who had any paid_started_at before their first exposure. Return variant, exposed_users, converters_60d, conversion_rate_60d, and 95% Wald CIs. Then add a second query for the triggered analysis restricted to users who actually started a trial (if variant='trial'). Finally, write a data‑quality query to flag anomalies: users with a trial but no prior exposure; multiple exposures on the same day; exposure timestamps after trial_started_at; or duplicate user_ids.
Schema:
- users(user_id INT, country STRING, created_at TIMESTAMP)
- exposures(user_id INT, exposed_at TIMESTAMP, variant STRING CHECK (variant IN ('control','trial')))
- trials(user_id INT, trial_started_at TIMESTAMP)
- payments(user_id INT, paid_started_at TIMESTAMP, plan STRING, revenue_usd DECIMAL(10,2))
ASCII samples:
users
+---------+---------+---------------------+
| user_id | country | created_at |
+---------+---------+---------------------+
| 1 | US | 2025-05-28 10:00:00 |
| 2 | US | 2025-06-02 09:00:00 |
| 3 | GB | 2025-06-10 14:30:00 |
| 4 | IN | 2025-06-15 21:10:00 |
| 5 | US | 2025-06-20 08:05:00 |
+---------+---------+---------------------+
exposures
+---------+---------------------+---------+
| user_id | exposed_at | variant |
+---------+---------------------+---------+
| 1 | 2025-06-01 12:00:00 | control |
| 2 | 2025-06-03 12:05:00 | trial |
| 2 | 2025-06-04 12:05:00 | trial |
| 3 | 2025-06-11 16:00:00 | trial |
| 4 | 2025-06-15 22:00:00 | control |
+---------+---------------------+---------+
trials
+---------+---------------------+
| user_id | trial_started_at |
+---------+---------------------+
| 2 | 2025-06-03 12:06:00 |
| 3 | 2025-06-12 10:00:00 |
+---------+---------------------+
payments
+---------+---------------------+-------+-------------+
| user_id | paid_started_at | plan | revenue_usd |
+---------+---------------------+-------+-------------+
| 1 | 2025-06-20 08:00:00 | plus | 20.00 |
| 2 | 2025-08-01 09:00:00 | pro | 40.00 |
| 3 | 2025-07-25 11:00:00 | plus | 20.00 |
+---------+---------------------+-------+-------------+
Assume timestamps are UTC; treat 60 days as DATE_DIFF('day', first_exposure, paid_started_at) BETWEEN 0 AND 60. Be careful to: (a) define first_exposure per user, (b) prevent leakage from pre‑exposure payments, and (c) avoid double counting across variants.
Quick Answer: This question evaluates proficiency in SQL-based cohort construction, deduplication and exclusion logic, conversion-rate computation with 95% Wald confidence intervals, and the distinction between intent-to-treat and triggered analyses.