Analyze document collaboration from CSV logs
Company: Notion
Role: Data Scientist
Category: Data Manipulation (SQL/Python)
Difficulty: easy
Interview Round: Technical Screen
You are given two CSV files exported from a document collaboration product.
## Data
### 1) `doc_access.csv`
Each row represents a user accessing (viewing/editing) a document page.
| column | type | description |
|---|---|---|
| `page_id` | STRING/INT | Document/page identifier |
| `user_id` | STRING/INT | User who accessed the document |
| `event_ts` | TIMESTAMP | When the access happened (assume UTC) |
| `is_creator` | BOOLEAN | `TRUE` if this user is the creator/owner of the document; otherwise the user is a collaborator |
| `collab_source` | STRING | For collaborators only (`is_creator=FALSE`): how the user became a collaborator (e.g., invited, link share, org default, etc.) |
Notes/assumptions:
- A “collaborator” is any record with `is_creator = FALSE`.
- A user may access the same `page_id` multiple times; unless otherwise specified, treat collaboration at the **document level** (i.e., dedupe by `(page_id, user_id)` where appropriate).
### 2) `users.csv`
| column | type | description |
|---|---|---|
| `user_id` | STRING/INT | User identifier |
| `country` | STRING | User’s country |
## Tasks
### Q1) Distribution of collaborator sources
Compute the distribution of how collaborators got onto documents.
**Output:**
- `collab_source`
- `num_collaborators` (count of collaborator users; specify whether you count unique users overall or unique `(page_id, user_id)` pairs)
- `pct_of_collaborators`
### Q2) For a given user, who did they collaborate with most?
Given an input parameter `target_user_id`, find the other user they collaborated with the most.
Define “collaborated with” as: the two users both appear on the same `page_id` (after deduping to unique `(page_id, user_id)` memberships).
**Output:**
- `target_user_id`
- `top_collaborator_user_id`
- `num_shared_pages` (number of distinct `page_id` where both users appear)
Also describe how you would break ties (e.g., smallest user_id, most recent shared collaboration, etc.).
### Q3) Which country collaborates the most?
Determine which country’s users are most likely to collaborate with others.
Because “collaboration rate” can be defined multiple ways, do the following:
1. Propose a clear numerator and denominator for a per-country collaboration rate.
- Example definition (acceptable):
- Numerator: # of users in the country who have collaborated with at least one other distinct user on at least one page.
- Denominator: total # of users in the country.
2. Compute the collaboration rate by country and return the top country.
**Output:**
- `country`
- `collaboration_rate`
- any intermediate counts you used (e.g., `num_users_in_country`, `num_users_who_collaborated`).
Quick Answer: This question evaluates data manipulation and analytical competencies—particularly deduplication, joins, aggregation, co-occurrence counting, and metric definition—applied to CSV access logs and user metadata.