Aggregate user logs into 30-minute sessions
Company: Robinhood
Role: Data Engineer
Category: Coding & Algorithms
Difficulty: hard
Interview Round: Technical Screen
Quick Answer: This question evaluates data engineering competencies in time-series sessionization, datetime parsing and arithmetic, grouping and aggregation, distinct-count computation, and CSV log processing.
Constraints
- 0 <= number of log rows <= 200000
- Each row has the format `user_id,log_datetime,topic`
- Timestamps are valid and use the format `YYYY-MM-DD HH:MM:SS`
- A gap of exactly 30 minutes stays in the same session; only gaps greater than 30 minutes start a new session
Examples
Input: "user_id,log_datetime,topic\n001,2025-03-01 00:01:00,pricing\n001,2025-03-01 00:02:00,hotel\n001,2025-03-01 00:03:00,pricing\n001,2025-03-01 01:30:00,restaurant\n001,2025-03-01 02:30:00,restaurant\n"
Expected Output: [["001", "2025-03-01 00:01:00", "2025-03-01 00:03:00", 2, 3], ["001", "2025-03-01 01:30:00", "2025-03-01 02:00:00", 1, 1], ["001", "2025-03-01 02:30:00", "2025-03-01 03:00:00", 1, 1]]
Explanation: The first three rows are within 30 minutes of each other, so they form one session. The last two rows are each more than 30 minutes apart from the previous row, so each becomes a single-event session with a 30-minute assumed duration.
Input: "user_id,log_datetime,topic\n002,2025-03-01 10:31:00,sports\n001,2025-03-01 09:30:00,a\n001,2025-03-01 10:00:00,b\n002,2025-03-01 10:00:00,news\n001,2025-03-01 10:31:00,a\n002,2025-03-01 10:30:00,news\n001,2025-03-01 09:00:00,a\n"
Expected Output: [["001", "2025-03-01 09:00:00", "2025-03-01 10:00:00", 2, 3], ["001", "2025-03-01 10:31:00", "2025-03-01 11:01:00", 1, 1], ["002", "2025-03-01 10:00:00", "2025-03-01 10:31:00", 2, 3]]
Explanation: Rows are not initially ordered, so each user's events must be sorted first. For user 001, gaps of exactly 30 minutes stay in the same session, but a 31-minute gap starts a new session. For user 002, all three events belong to one session.
Input: "user_id,log_datetime,topic\n007,2025-07-04 12:00:00,travel\n"
Expected Output: [["007", "2025-07-04 12:00:00", "2025-07-04 12:30:00", 1, 1]]
Explanation: A single event forms a one-row session, so the session end is 30 minutes after the start.
Input: "user_id,log_datetime,topic\n"
Expected Output: []
Explanation: There are no data rows, so there are no sessions.
Hints
- Group rows by user first, then sort each user's events by timestamp before building sessions.
- When scanning a user's events, keep track of the current session start time, last event time, event count, and a set of distinct topics.