Write SQL for visibility, calls, and cohort activity
Company: Meta
Role: Data Scientist
Category: Data Manipulation (SQL/Python)
Difficulty: Medium
Interview Round: HR Screen
You have the following schema and toy data. Assume "today" = 2025-09-01.
users(user_id INT, signup_date DATE)
Sample:
user_id | signup_date
--------+------------
1 | 2025-08-31
2 | 2025-08-15
3 | 2025-06-01
4 | 2025-09-01
shops(shop_id INT, category TEXT)
Sample:
shop_id | category
--------+---------
10 | Bakery
11 | Grocery
12 | Pharmacy
visibility_events(event_time TIMESTAMP, user_id INT, shop_id INT, position INT, visible BOOLEAN, dwell_seconds INT)
Sample:
event_time | user_id | shop_id | position | visible | dwell_seconds
---------------------+---------+---------+----------+---------+--------------
2025-09-01 09:00:00 | 1 | 10 | 1 | true | 6
2025-09-01 09:00:05 | 1 | 11 | 4 | true | 2
2025-09-01 09:10:00 | 2 | 10 | 2 | true | 8
2025-09-01 10:00:00 | 3 | 12 | 1 | true | 12
2025-09-01 10:05:00 | 3 | 10 | 5 | false | 0
2025-09-01 11:00:00 | 4 | 11 | 2 | true | 7
calls(started_at TIMESTAMP, caller_id INT, receiver_id INT)
Sample:
started_at | caller_id | receiver_id
---------------------+-----------+------------
2025-09-01 08:00:00 | 1 | 2
2025-09-01 09:30:00 | 3 | 1
2025-08-31 22:00:00 | 2 | 4
Tasks (write SQL; be careful about edge cases like duplicates, invisible rows, and users with zero activity today):
A) For 2025-09-01, return shop_id, unique_viewers, avg_dwell_seconds among visible=true rows, for shops with unique_viewers >= 2, ordered by unique_viewers DESC then shop_id ASC. Use GROUP BY, HAVING, ORDER BY.
B) For 2025-09-01, join shops to visibility_events and compute, by category, the top_of_feed_visibility_rate = visible events with position <= 3 divided by all visible events. Return category and the rate rounded to 3 decimals, sorted DESC. Use a CASE and a JOIN.
C) For 2025-09-01, for each user_id, compute seconds_to_next_visible_event using a window function over that user's visible=true events ordered by event_time. Return user_id, event_time, seconds_to_next_visible_event (NULL for the last visible event in the day).
D) You’re told to compare new vs. old user engagement using only today’s snapshot (2025-09-01) and to “bucket by duration.” Define user_age_days = DATEDIFF(day, signup_date, DATE '2025-09-01'). Create buckets: [0,7), [7,30), [30,180), [180,INF). Define active_today as having at least one visible=true event on 2025-09-01 with dwell_seconds >= 5. Using only today’s data, compute per bucket: dau_today (distinct users with any visible=true event), active_users_today (distinct users with active_today), and active_rate_today = active_users_today / dau_today. Return bucket_label, dau_today, active_users_today, active_rate_today, sorted by bucket order. Discuss one bias introduced by restricting the denominator to users observed today.
E) On 2025-09-01, count distinct users who participated in at least one call (as caller or receiver). Provide two queries: one using UNION and one using UNION ALL, and explain precisely when they produce different counts on the sample data. Then state the general rule for when UNION ALL is safe here and when it will double count.
F) Bonus: Using a window function, for each shop_id on 2025-09-01 compute a 3-event moving average of dwell_seconds over time (ROWS BETWEEN 2 PRECEDING AND CURRENT ROW) and return shop_id, event_time, ma3_dwell_seconds.
Quick Answer: This question evaluates a candidate's proficiency in SQL-based data manipulation, covering aggregations, joins, window functions, date arithmetic, deduplication, cohorting, and calculation of engagement metrics.