Compute signup rate and retention from raw logs
Company: OpenAI
Role: Data Scientist
Category: Data Manipulation (SQL/Python)
Difficulty: easy
Interview Round: Technical Screen
## Scenario
You are analyzing an A/B test for a marketing campaign offering **a free 1-month trial**.
You are given raw “upstream” tables that resemble production event logs rather than pre-aggregated metrics.
Assume **UTC timestamps** and that each user has **at most one assignment**.
## Tables (upstream)
### `experiment_assignment`
- `user_id` (STRING, PK)
- `experiment_id` (STRING)
- `variant` (STRING; 'control' or 'treatment')
- `assigned_at` (TIMESTAMP)
### `offer_exposure_events`
- `user_id` (STRING)
- `exposed_at` (TIMESTAMP)
- `campaign_id` (STRING)
### `subscription_events`
- `user_id` (STRING)
- `event_time` (TIMESTAMP)
- `event_type` (STRING; one of 'trial_start', 'paid_start', 'cancel')
### `app_sessions`
- `user_id` (STRING)
- `session_start` (TIMESTAMP)
## Metric definitions
1) **Signup rate (trial start within 7 days)**
- Denominator: users assigned to the experiment.
- Numerator: users with a `trial_start` event where `event_time` is in `[assigned_at, assigned_at + 7 days)`.
2) **D30 activity retention**
Among users who started a trial within 7 days, the user is retained if they have **≥ 1 app session** with `session_start` in `[assigned_at + 30 days, assigned_at + 37 days)`.
## Task
Write SQL to output one row per variant with:
- `variant`
- `assigned_users`
- `signup_users_7d`
- `signup_rate_7d`
- `retained_users_d30`
- `retention_rate_d30`
Also briefly describe what an “upstream” dataset/logging pipeline must contain to compute these metrics reliably (e.g., assignment logs, exposure logs, subscription lifecycle events, identity consistency).
Quick Answer: This question evaluates the ability to manipulate event-level logs to compute cohort metrics—specifically signup rate and D30 retention—testing proficiency in SQL/Python, event-time windowing, joins, aggregation, and understanding of upstream logging semantics.