Analyze document collaboration patterns
Company: Notion
Role: Data Scientist
Category: Data Manipulation (SQL/Python)
Difficulty: medium
Interview Round: Technical Screen
You are given two CSV files and asked to analyze collaboration behavior on documents.
### File 1: `document_activity.csv`
Each row is a document view event.
Schema:
- `page_id` STRING — unique document ID
- `user_id` STRING — user who viewed the document
- `viewed_at` TIMESTAMP — event timestamp in UTC
- `is_creator` BOOLEAN — `true` if the user is the document creator, `false` otherwise
- `collaborator_source` STRING NULL — for non-creators, how the user arrived at the document (for example: invite link, email invite, search, direct). This field may be `NULL` for creators.
### File 2: `users.csv`
Schema:
- `user_id` STRING — unique user ID
- `country` STRING — user country
### Assumptions
- Join key: `document_activity.user_id = users.user_id`.
- Use the full dataset; do not filter by date.
- A user may view the same document multiple times, so for collaboration questions you should deduplicate to distinct `(page_id, user_id)` pairs unless a task explicitly asks for event counts.
- A document may have one creator and multiple collaborators.
### Tasks
1. **Compute the distribution of collaborator acquisition sources.**
- Consider only non-creators (`is_creator = false`).
- Define the metric on distinct collaborator-document pairs `(page_id, user_id)`.
- Required output columns:
- `collaborator_source`
- `collaborator_count`
- `collaborator_share`
2. **Given an input `user_id`, find which other users collaborated with that user the most.**
- Two users are considered to have collaborated if they both appeared on the same `page_id` at least once.
- Count distinct shared documents, not view events.
- Exclude the target user themself.
- Required output columns:
- `target_user_id`
- `co_user_id`
- `shared_page_count`
- Return the user or users with the maximum `shared_page_count`.
3. **Find which country's users are most likely to collaborate with others.**
- To make the metric precise, define a user as having "collaborated with others" if they appear on at least one document with at least one other distinct user (that is, the document has 2 or more distinct users total).
- For each country, compute:
- `collaborating_users` = number of distinct users in that country who collaborated with at least one other user
- `total_users` = number of distinct users in that country from `users.csv`
- `collaboration_rate = collaborating_users / total_users`
- Required output columns:
- `country`
- `collaborating_users`
- `total_users`
- `collaboration_rate`
- Rank countries from highest to lowest collaboration rate.
Quick Answer: This question evaluates a candidate's competency in data manipulation and analytics—specifically joins, deduplication of user-document pairs, aggregation, and computation of user- and country-level collaboration metrics using SQL or Python.