This question evaluates data engineering competencies in time-series sessionization, datetime parsing and arithmetic, grouping and aggregation, distinct-count computation, and CSV log processing.
You are given a CSV file with columns:
user_id | log_datetime | topic
Example input rows (already in time order for the same user):
001 | 2025-03-01 00:01:00 | pricing
001 | 2025-03-01 00:02:00 | hotel
001 | 2025-03-01 00:03:00 | pricing
001 | 2025-03-01 01:30:00 | restaurant
001 | 2025-03-01 02:30:00 | restaurant
Task (Python):
log_datetime
and split them into sessions: a new session starts when the gap from the previous event is
more than 30 minutes
.
user_id
session_start
(timestamp of first event)
session_end
session_end = session_start + 30 minutes
(treating 30 minutes as the maximum assumed duration)
count_topic
: number of
distinct
topics in the session
count_time
: number of events (rows) in the session
Expected output for the example:
001 | 2025-03-01 00:01:00 | 2025-03-01 00:03:00 | 2 | 3
001 | 2025-03-01 01:30:00 | 2025-03-01 02:00:00 | 1 | 1
001 | 2025-03-01 02:30:00 | 2025-03-01 03:00:00 | 1 | 1
Implement a function/program that reads the CSV and produces these session aggregates.