Label new vs old users over time in SQL
Company: Meta
Role: Data Scientist
Category: Data Manipulation (SQL/Python)
Difficulty: Medium
Interview Round: Technical Screen
Define users as “new” during the first 30 days inclusive after their signup_date, and “old” thereafter. Produce per-user, per-day labels over a window, acknowledging that a single user naturally has different labels on different days. Use today = 2025-09-01.
Schema:
users(user_id INT, signup_date DATE)
activity(user_id INT, activity_date DATE) -- any daily activity counts as presence
Sample data:
users
+---------+-------------+
| user_id | signup_date |
+---------+-------------+
| 1 | 2025-07-25 |
| 2 | 2025-08-20 |
| 3 | 2025-08-01 |
| 4 | 2025-09-01 |
+---------+-------------+
activity
+---------+---------------+
| user_id | activity_date |
+---------+---------------+
| 1 | 2025-08-02 |
| 1 | 2025-08-26 |
| 2 | 2025-08-25 |
| 2 | 2025-09-01 |
| 3 | 2025-08-15 |
| 3 | 2025-09-01 |
| 4 | 2025-09-01 |
+---------+---------------+
Tasks:
A) Write SQL that outputs (user_id, d, label) for every day d in [2025-08-02, 2025-09-01], labeling “new” if d between signup_date and signup_date + 29 days inclusive, else “old”. Avoid assigning multiple labels for the same user-day. Prefer generating dates only where activity exists (join to activity) to reduce scan cost.
B) Using your labels, compute two aggregates for the last 30 days relative to today (window = [2025-08-03, 2025-09-01]): (i) active_new_users and active_old_users per day; (ii) a single summary row with total distinct users by label over the window. Explain any assumptions about time zones and inclusive/exclusive bounds.
Quick Answer: This question evaluates a candidate's ability to perform time-based user labeling and aggregation in SQL, testing skills such as date arithmetic, joins, deduplication of user-day records, and efficient generation of per-user-per-day labels.