How do I approach Data Manipulation (SQL/Python) interview questions?

Data Manipulation (SQL/Python) questions require understanding of core concepts and practice. PracHub provides solutions with explanations to help you master data manipulation (sql/python) interviews.

What difficulty level is this interview question?

This is a medium difficulty Data Manipulation (SQL/Python) question, commonly asked during Technical Screen rounds at Notion.

What role is this question designed for?

This question is commonly asked for Data Scientist candidates at Notion during technical interviews.

Analyze document collaboration patterns | Notion Interview Question

Analyze document collaboration patterns

Company: Notion

Role: Data Scientist

Category: Data Manipulation (SQL/Python)

Difficulty: medium

Interview Round: Technical Screen

You are given two CSV files and asked to analyze collaboration behavior on documents. ### File 1: `document_activity.csv` Each row is a document view event. Schema: - `page_id` STRING — unique document ID - `user_id` STRING — user who viewed the document - `viewed_at` TIMESTAMP — event timestamp in UTC - `is_creator` BOOLEAN — `true` if the user is the document creator, `false` otherwise - `collaborator_source` STRING NULL — for non-creators, how the user arrived at the document (for example: invite link, email invite, search, direct). This field may be `NULL` for creators. ### File 2: `users.csv` Schema: - `user_id` STRING — unique user ID - `country` STRING — user country ### Assumptions - Join key: `document_activity.user_id = users.user_id`. - Use the full dataset; do not filter by date. - A user may view the same document multiple times, so for collaboration questions you should deduplicate to distinct `(page_id, user_id)` pairs unless a task explicitly asks for event counts. - A document may have one creator and multiple collaborators. ### Tasks 1. **Compute the distribution of collaborator acquisition sources.** - Consider only non-creators (`is_creator = false`). - Define the metric on distinct collaborator-document pairs `(page_id, user_id)`. - Required output columns: - `collaborator_source` - `collaborator_count` - `collaborator_share` 2. **Given an input `user_id`, find which other users collaborated with that user the most.** - Two users are considered to have collaborated if they both appeared on the same `page_id` at least once. - Count distinct shared documents, not view events. - Exclude the target user themself. - Required output columns: - `target_user_id` - `co_user_id` - `shared_page_count` - Return the user or users with the maximum `shared_page_count`. 3. **Find which country's users are most likely to collaborate with others.** - To make the metric precise, define a user as having "collaborated with others" if they appear on at least one document with at least one other distinct user (that is, the document has 2 or more distinct users total). - For each country, compute: - `collaborating_users` = number of distinct users in that country who collaborated with at least one other user - `total_users` = number of distinct users in that country from `users.csv` - `collaboration_rate = collaborating_users / total_users` - Required output columns: - `country` - `collaborating_users` - `total_users` - `collaboration_rate` - Rank countries from highest to lowest collaboration rate.

Quick Answer: This question evaluates a candidate's competency in data manipulation and analytics—specifically joins, deduplication of user-document pairs, aggregation, and computation of user- and country-level collaboration metrics using SQL or Python.

You are given two CSV files and asked to analyze collaboration behavior on documents.

File 1: `document_activity.csv`

Each row is a document view event.

Schema:

page_id STRING — unique document ID
user_id STRING — user who viewed the document
viewed_at TIMESTAMP — event timestamp in UTC
is_creator BOOLEAN — true if the user is the document creator, false otherwise
collaborator_source STRING NULL — for non-creators, how the user arrived at the document (for example: invite link, email invite, search, direct). This field may be NULL for creators.

File 2: `users.csv`

Schema:

user_id STRING — unique user ID
country STRING — user country

Assumptions

Join key: document_activity.user_id = users.user_id .
Use the full dataset; do not filter by date.
A user may view the same document multiple times, so for collaboration questions you should deduplicate to distinct (page_id, user_id) pairs unless a task explicitly asks for event counts.
A document may have one creator and multiple collaborators.

Tasks

Compute the distribution of collaborator acquisition sources.
- Consider only non-creators ( is_creator = false ).
- Define the metric on distinct collaborator-document pairs (page_id, user_id) .
- Required output columns:
  - collaborator_source
  - collaborator_count
  - collaborator_share
Given an input user_id, find which other users collaborated with that user the most.
- Two users are considered to have collaborated if they both appeared on the same page_id at least once.
- Count distinct shared documents, not view events.
- Exclude the target user themself.
- Required output columns:
  - target_user_id
  - co_user_id
  - shared_page_count
- Return the user or users with the maximum shared_page_count .
Find which country's users are most likely to collaborate with others.
- To make the metric precise, define a user as having "collaborated with others" if they appear on at least one document with at least one other distinct user (that is, the document has 2 or more distinct users total).
- For each country, compute:
  - collaborating_users = number of distinct users in that country who collaborated with at least one other user
  - total_users = number of distinct users in that country from users.csv
  - collaboration_rate = collaborating_users / total_users
- Required output columns:
  - country
  - collaborating_users
  - total_users
  - collaboration_rate
- Rank countries from highest to lowest collaboration rate.

Analyze document collaboration patterns

Company: Notion

Role: Data Scientist

Category: Data Manipulation (SQL/Python)

Difficulty: medium

Interview Round: Technical Screen

You are given two CSV files and asked to analyze collaboration behavior on documents.

File 1: `document_activity.csv`

Each row is a document view event.

Schema:

page_id STRING — unique document ID
user_id STRING — user who viewed the document
viewed_at TIMESTAMP — event timestamp in UTC
is_creator BOOLEAN — true if the user is the document creator, false otherwise
collaborator_source STRING NULL — for non-creators, how the user arrived at the document (for example: invite link, email invite, search, direct). This field may be NULL for creators.

File 2: `users.csv`

Schema:

user_id STRING — unique user ID
country STRING — user country

Assumptions

Join key: document_activity.user_id = users.user_id .
Use the full dataset; do not filter by date.
A user may view the same document multiple times, so for collaboration questions you should deduplicate to distinct (page_id, user_id) pairs unless a task explicitly asks for event counts.
A document may have one creator and multiple collaborators.

Tasks

Compute the distribution of collaborator acquisition sources.
- Consider only non-creators ( is_creator = false ).
- Define the metric on distinct collaborator-document pairs (page_id, user_id) .
- Required output columns:
  - collaborator_source
  - collaborator_count
  - collaborator_share
Given an input user_id, find which other users collaborated with that user the most.
- Two users are considered to have collaborated if they both appeared on the same page_id at least once.
- Count distinct shared documents, not view events.
- Exclude the target user themself.
- Required output columns:
  - target_user_id
  - co_user_id
  - shared_page_count
- Return the user or users with the maximum shared_page_count .
Find which country's users are most likely to collaborate with others.
- To make the metric precise, define a user as having "collaborated with others" if they appear on at least one document with at least one other distinct user (that is, the document has 2 or more distinct users total).
- For each country, compute:
  - collaborating_users = number of distinct users in that country who collaborated with at least one other user
  - total_users = number of distinct users in that country from users.csv
  - collaboration_rate = collaborating_users / total_users
- Required output columns:
  - country
  - collaborating_users
  - total_users
  - collaboration_rate
- Rank countries from highest to lowest collaboration rate.

Analyze document collaboration patterns

Quick Overview

File 1: `document_activity.csv`

File 2: `users.csv`

Assumptions

Tasks

Comments (0)

Analyze document collaboration patterns

Quick Overview

File 1: `document_activity.csv`

File 2: `users.csv`

Assumptions

Tasks

Comments (0)

Analyze document collaboration patterns

Quick Overview