Part A — SQL (use the schema and sample data below): Compute 48-hour unique CTR for campaign_id=100 by variant, deduplicating to the earliest send per (user_id, campaign_id) on/after 2025-08-30, and excluding internal accounts user_id <= 2. Define CTR as distinct users with ≥1 click in [send_time, send_time + 48h] divided by distinct users sent (after deduplication). Return columns: campaign_id, variant, sends, unique_clickers_48h, ctr_48h. Then, also return a single row with the absolute lift (test_ctr − control_ctr). Optionally, include counts needed to compute a 95% CI for the lift using the normal approximation so it can be calculated downstream.
Schema: users(user_id INT, signup_dt DATE, locale STRING) email_sends(send_id INT, user_id INT, campaign_id INT, send_time TIMESTAMP, variant STRING) -- variant in ('control','test') email_events(event_id INT, send_id INT, event_type STRING, event_time TIMESTAMP) -- event_type in ('open','click','unsubscribe')
Sample data: users +---------+------------+--------+ | user_id | signup_dt | locale | +---------+------------+--------+ | 1 | 2025-08-20 | US | | 2 | 2025-08-22 | US | | 3 | 2025-08-25 | CA | | 4 | 2025-08-27 | US | | 5 | 2025-08-28 | GB |
email_sends +---------+---------+------------+---------------------+----------+ | send_id | user_id | campaign_id| send_time | variant | +---------+---------+------------+---------------------+----------+ | 10 | 1 | 100 | 2025-08-30 09:00:00 | control | | 11 | 2 | 100 | 2025-08-30 09:00:00 | control | | 12 | 3 | 100 | 2025-08-30 09:00:00 | test | | 13 | 1 | 100 | 2025-08-31 09:00:00 | test | -- resend | 14 | 4 | 101 | 2025-08-31 10:00:00 | control | | 15 | 5 | 100 | 2025-08-30 09:00:00 | test |
email_events +----------+---------+------------+---------------------+ | event_id | send_id | event_type | event_time | +----------+---------+------------+---------------------+ | 1000 | 10 | open | 2025-08-30 09:05:00 | | 1001 | 10 | click | 2025-08-30 09:06:00 | | 1002 | 11 | open | 2025-08-30 09:10:00 | | 1003 | 12 | open | 2025-08-30 09:07:00 | | 1004 | 12 | unsubscribe| 2025-08-30 10:00:00 | | 1005 | 13 | click | 2025-08-31 09:12:00 | | 1006 | 15 | click | 2025-08-31 09:00:00 | | 1007 | 14 | click | 2025-08-31 10:05:00 |
Part B — Python (pandas/sklearn): Given a DataFrame df with columns y_true (0/1), y_prob (predicted probability), and weight (non-negative), write code to: (1) plot the precision-recall curve; (2) find the threshold that maximizes weighted F1 using 'weight' as sample weights; (3) compute precision@top1% of the population by scoring, breaking ties deterministically. Then, explain why a model can have ROC-AUC=0.86 but PR-AUC=0.18 at 1% prevalence, and which metric you would optimize for a marketing CTR use case.