Write SQL/Python for CTR analytics
Company: Uber
Role: Data Scientist
Category: Data Manipulation (SQL/Python)
Difficulty: Medium
Interview Round: Technical Screen
Part A — SQL (use the schema and sample data below): Compute 48-hour unique CTR for campaign_id=100 by variant, deduplicating to the earliest send per (user_id, campaign_id) on/after 2025-08-30, and excluding internal accounts user_id <= 2. Define CTR as distinct users with ≥1 click in [send_time, send_time + 48h] divided by distinct users sent (after deduplication). Return columns: campaign_id, variant, sends, unique_clickers_48h, ctr_48h. Then, also return a single row with the absolute lift (test_ctr − control_ctr). Optionally, include counts needed to compute a 95% CI for the lift using the normal approximation so it can be calculated downstream.
Schema:
users(user_id INT, signup_dt DATE, locale STRING)
email_sends(send_id INT, user_id INT, campaign_id INT, send_time TIMESTAMP, variant STRING) -- variant in ('control','test')
email_events(event_id INT, send_id INT, event_type STRING, event_time TIMESTAMP) -- event_type in ('open','click','unsubscribe')
Sample data:
users
+---------+------------+--------+
| user_id | signup_dt | locale |
+---------+------------+--------+
| 1 | 2025-08-20 | US |
| 2 | 2025-08-22 | US |
| 3 | 2025-08-25 | CA |
| 4 | 2025-08-27 | US |
| 5 | 2025-08-28 | GB |
email_sends
+---------+---------+------------+---------------------+----------+
| send_id | user_id | campaign_id| send_time | variant |
+---------+---------+------------+---------------------+----------+
| 10 | 1 | 100 | 2025-08-30 09:00:00 | control |
| 11 | 2 | 100 | 2025-08-30 09:00:00 | control |
| 12 | 3 | 100 | 2025-08-30 09:00:00 | test |
| 13 | 1 | 100 | 2025-08-31 09:00:00 | test | -- resend
| 14 | 4 | 101 | 2025-08-31 10:00:00 | control |
| 15 | 5 | 100 | 2025-08-30 09:00:00 | test |
email_events
+----------+---------+------------+---------------------+
| event_id | send_id | event_type | event_time |
+----------+---------+------------+---------------------+
| 1000 | 10 | open | 2025-08-30 09:05:00 |
| 1001 | 10 | click | 2025-08-30 09:06:00 |
| 1002 | 11 | open | 2025-08-30 09:10:00 |
| 1003 | 12 | open | 2025-08-30 09:07:00 |
| 1004 | 12 | unsubscribe| 2025-08-30 10:00:00 |
| 1005 | 13 | click | 2025-08-31 09:12:00 |
| 1006 | 15 | click | 2025-08-31 09:00:00 |
| 1007 | 14 | click | 2025-08-31 10:05:00 |
Part B — Python (pandas/sklearn): Given a DataFrame df with columns y_true (0/1), y_prob (predicted probability), and weight (non-negative), write code to: (1) plot the precision-recall curve; (2) find the threshold that maximizes weighted F1 using 'weight' as sample weights; (3) compute precision@top1% of the population by scoring, breaking ties deterministically. Then, explain why a model can have ROC-AUC=0.86 but PR-AUC=0.18 at 1% prevalence, and which metric you would optimize for a marketing CTR use case.
Quick Answer: This question evaluates proficiency in data manipulation and analytics, covering SQL windowing and deduplication for time‑windowed cohort metrics (48‑hour unique CTR) as well as practical Python model evaluation tasks such as precision‑recall plotting, weighted F1 threshold selection, and precision@top‑k.