Analyze Document Collaboration CSVs
Company: Notion
Role: Data Scientist
Category: Data Manipulation (SQL/Python)
Difficulty: medium
Interview Round: Technical Screen
You are given two CSV files.
`document_activity.csv`
- `page_id` STRING — document identifier
- `user_id` STRING — user who viewed or worked on the document
- `viewed_at` TIMESTAMP — event time in UTC
- `is_creator` BOOLEAN — `true` if the user created the document
- `collaborator_source` STRING NULL — for non-creators only, the channel by which the collaborator reached the document (for example: invite, shared link, search). This field is `NULL` for creators.
`users.csv`
- `user_id` STRING — user identifier
- `country` STRING — user's country
Assumptions:
- `document_activity.user_id` joins to `users.user_id`.
- Each row represents one user-page event.
- A document is considered collaborative if at least two distinct users appear on the same `page_id`.
- No date filtering is required; use the full dataset.
- Treat all timestamps as UTC.
Tasks:
1. **Distribution of collaborator acquisition channels**
- Consider only rows where `is_creator = false`.
- Compute the distribution of `collaborator_source`.
- Return: `collaborator_source`, `collaborator_count`, `collaborator_share`.
- Define `collaborator_share` as `collaborator_count / total_non_creator_rows`.
2. **Find who collaborated most with a given user**
- Given an input `user_id`, find the other user or users who collaborated most often with that user.
- Define collaboration as appearing on the same `page_id`.
- Count distinct shared `page_id` values, not raw events.
- Return: `input_user_id`, `other_user_id`, `shared_page_count`.
- If multiple users tie for the maximum, return all tied users ordered by `other_user_id`.
3. **Identify which country has the highest collaboration rate**
- Join to `users.csv` to get each user's country.
- For each country, define:
- `total_active_users` = number of distinct users from that country who appear in `document_activity`
- `collaborating_users` = number of those users who appear on at least one collaborative document
- `collaboration_rate` = `collaborating_users / total_active_users`
- Return: `country`, `collaborating_users`, `total_active_users`, `collaboration_rate`.
- Rank countries by `collaboration_rate` descending.
Quick Answer: This question evaluates data manipulation and analytical skills, including joining user and activity records, aggregating and deduplicating events, and computing collaboration metrics across CSV datasets.