Solve SQL CTR and Python analytics tasks
Company: Upstart
Role: Data Scientist
Category: Data Manipulation (SQL/Python)
Difficulty: Medium
Interview Round: HR Screen
Part A — SQL: Compute click-through rate (CTR) by pin_format for US new users. New users are those whose signup_date is within 30 days (inclusive) of the action event_date.
Schema (invented):
- events(event_date DATE, user_id INT, pin_id INT, event_type STRING CHECK IN ('impression','click'), event_count INT)
- users(user_id INT, country_code STRING, signup_date DATE)
- pin_classification(pin_id INT, pin_format STRING CHECK IN ('video','static','story','other'))
Notes/requirements:
- Join events→users on user_id and events→pin_classification on pin_id.
- Filter users.country_code = 'US'.
- Define new user as 0 ≤ DATEDIFF(event_date, signup_date) ≤ 30 (inclusive).
- Aggregate impressions = SUM(event_count WHERE event_type='impression'); clicks = SUM(event_count WHERE event_type='click').
- CTR = clicks / impressions, computed after aggregation at the pin_format level; exclude pin_formats whose aggregated impressions = 0 to avoid division by zero.
- Return one row per pin_format with columns: pin_format, impressions, clicks, ctr (rounded to 4 decimals), and the distinct_new_users contributing to those events.
- Ensure no double counting if multiple rows per (event_date, user_id, pin_id, event_type).
ASCII sample tables:
users
+---------+--------------+-------------+
| user_id | country_code | signup_date |
+---------+--------------+-------------+
| 101 | US | 2025-07-20 |
| 102 | US | 2025-08-05 |
| 103 | CA | 2025-08-01 |
| 104 | US | 2025-06-10 |
+---------+--------------+-------------+
pin_classification
+--------+------------+
| pin_id | pin_format |
+--------+------------+
| 10 | video |
| 11 | static |
| 12 | video |
| 13 | story |
+--------+------------+
events
+------------+---------+--------+------------+-------------+
| event_date | user_id | pin_id | event_type | event_count |
+------------+---------+--------+------------+-------------+
| 2025-08-01 | 101 | 10 | impression | 5 |
| 2025-08-01 | 101 | 10 | click | 1 |
| 2025-08-10 | 102 | 11 | impression | 3 |
| 2025-08-10 | 102 | 11 | click | 0 |
| 2025-08-15 | 103 | 12 | impression | 4 |
| 2025-08-15 | 103 | 12 | click | 2 |
| 2025-08-20 | 104 | 13 | impression | 7 |
| 2025-08-20 | 104 | 13 | click | 1 |
+------------+---------+--------+------------+-------------+
Question: Write a single SQL query that produces the required output, handling all constraints above.
Part B — Python (Pandas + core Python):
1) Given a DataFrame df with columns [pin_id:int, category_id:int, time_spent_seconds:float, pin_format:str] and a dict category_map = {category_id:int -> category_name:str}, return the category_name with the highest average time_spent_seconds among rows where pin_format case-insensitively equals 'video'. Break ties by lexicographically smallest category_name. Exclude null/NaN time_spent_seconds. Aim for vectorized Pandas (no Python loops over rows).
2) Given a dict user_to_pins = {'user1':[1,2,2], 'user2':[1,2,3], ...}, compute the average number of unique pin_ids per user (i.e., mean over users of |set(pin_ids)|), returning a float. Handle empty lists, extremely large lists, and memory efficiency for millions of users. Provide time/space complexity and code.
Quick Answer: This question evaluates SQL and Python data-manipulation competencies, covering date-based user filtering, joins and deduplication across events, users and pin metadata, aggregated CTR calculation with division-by-zero handling, and vectorized Pandas grouping, NaN handling, case-insensitive filtering, and tie-breaking.