Merge ad CSVs and compute CTR
Company: Capital One
Role: Data Scientist
Category: Data Manipulation (SQL/Python)
Difficulty: Medium
Interview Round: Take-home Project
Using SQL, clean and merge four CSVs and answer all parts exactly. Schema and sample rows (assume types: date is DATE, others INT/VARCHAR): platforms(platform_id, platform) +-------------+----------+ | platform_id | platform | | 10 | Web | | 20 | Mobile | +-------------+----------+ ads(ad_id, platform_id, video_id) +------+-------------+----------+ | ad_id| platform_id | video_id | | 1001 | 10 | 501 | | 1002 | 10 | 502 | | 1003 | 20 | 501 | +------+-------------+----------+ videos(video_id, title, duration_sec) +----------+--------------+--------------+ | video_id | title | duration_sec | | 501 | Summer Promo | 30 | | 502 | Winter Promo | 45 | +----------+--------------+--------------+ totals(date, ad_id, plays, clicks, watch_time_sec) +------------+------+-------+--------+------------------+ | date | ad_id| plays | clicks | watch_time_sec | | 2008-01-01 | 1001 | 150 | 18 | 3400 | | - | 1002 | 220 | 25 | 7100 | | - | 1003 | 90 | 5 | 1800 | | 2008-01-02 | 1001 | 130 | 17 | 3200 | | - | 1002 | 210 | 22 | 6800 | +------------+------+-------+--------+------------------+ Notes: In totals, a '-' in the date column means “same as the most recent non-dash date above in file order.” Assume file order is by appearance and stable. Tasks: (a) In SQL, forward-fill the date within totals so that each row has a valid DATE; do not use procedural code; assume you can reference row_number() over file order. (b) Produce a 7-day window starting 2008-01-01 (inclusive) and ending 2008-01-07 (inclusive). (c) Compute, for that window, per ad_id and per platform, total_plays, total_clicks, total_watch_time_sec, and CTR = CASE WHEN total_plays>0 THEN total_clicks*1.0/total_plays ELSE NULL END. (d) Return: (1) the overall top 3 ads by total_plays in the window (tie-break by higher CTR, then lower ad_id), with platform name and video title; and (2) for every ad present in the window, the date within the window on which its watch_time_sec is maximal (break ties by earliest date). (e) Ensure all joins are correct and no row duplication occurs (explain the join keys you used). Provide a single SQL query (CTEs allowed) that outputs both result sets, clearly labeled.
Quick Answer: This question evaluates SQL data manipulation competencies such as cleaning and merging CSVs, forward-filling missing dates, correct multi-table joins without duplication, use of window functions, time-windowed aggregations, and computation of click-through rate (CTR).