PracHub
QuestionsPremiumLearningGuidesCheatsheetNEWCoaches
|Home/Data Manipulation (SQL/Python)/Notion

Analyze document collaboration patterns

Last updated: Apr 10, 2026

Quick Overview

This question evaluates a candidate's competency in data manipulation and analytics—specifically joins, deduplication of user-document pairs, aggregation, and computation of user- and country-level collaboration metrics using SQL or Python.

  • medium
  • Notion
  • Data Manipulation (SQL/Python)
  • Data Scientist

Analyze document collaboration patterns

Company: Notion

Role: Data Scientist

Category: Data Manipulation (SQL/Python)

Difficulty: medium

Interview Round: Technical Screen

You are given two CSV files and asked to analyze collaboration behavior on documents. ### File 1: `document_activity.csv` Each row is a document view event. Schema: - `page_id` STRING — unique document ID - `user_id` STRING — user who viewed the document - `viewed_at` TIMESTAMP — event timestamp in UTC - `is_creator` BOOLEAN — `true` if the user is the document creator, `false` otherwise - `collaborator_source` STRING NULL — for non-creators, how the user arrived at the document (for example: invite link, email invite, search, direct). This field may be `NULL` for creators. ### File 2: `users.csv` Schema: - `user_id` STRING — unique user ID - `country` STRING — user country ### Assumptions - Join key: `document_activity.user_id = users.user_id`. - Use the full dataset; do not filter by date. - A user may view the same document multiple times, so for collaboration questions you should deduplicate to distinct `(page_id, user_id)` pairs unless a task explicitly asks for event counts. - A document may have one creator and multiple collaborators. ### Tasks 1. **Compute the distribution of collaborator acquisition sources.** - Consider only non-creators (`is_creator = false`). - Define the metric on distinct collaborator-document pairs `(page_id, user_id)`. - Required output columns: - `collaborator_source` - `collaborator_count` - `collaborator_share` 2. **Given an input `user_id`, find which other users collaborated with that user the most.** - Two users are considered to have collaborated if they both appeared on the same `page_id` at least once. - Count distinct shared documents, not view events. - Exclude the target user themself. - Required output columns: - `target_user_id` - `co_user_id` - `shared_page_count` - Return the user or users with the maximum `shared_page_count`. 3. **Find which country's users are most likely to collaborate with others.** - To make the metric precise, define a user as having "collaborated with others" if they appear on at least one document with at least one other distinct user (that is, the document has 2 or more distinct users total). - For each country, compute: - `collaborating_users` = number of distinct users in that country who collaborated with at least one other user - `total_users` = number of distinct users in that country from `users.csv` - `collaboration_rate = collaborating_users / total_users` - Required output columns: - `country` - `collaborating_users` - `total_users` - `collaboration_rate` - Rank countries from highest to lowest collaboration rate.

Quick Answer: This question evaluates a candidate's competency in data manipulation and analytics—specifically joins, deduplication of user-document pairs, aggregation, and computation of user- and country-level collaboration metrics using SQL or Python.

Related Interview Questions

  • Analyze document collaboration from CSV logs - Notion (easy)
  • Analyze Document Collaboration CSVs - Notion (medium)
Notion logo
Notion
Aug 10, 2025, 12:00 AM
Data Scientist
Technical Screen
Data Manipulation (SQL/Python)
7
0

You are given two CSV files and asked to analyze collaboration behavior on documents.

File 1: document_activity.csv

Each row is a document view event.

Schema:

  • page_id STRING — unique document ID
  • user_id STRING — user who viewed the document
  • viewed_at TIMESTAMP — event timestamp in UTC
  • is_creator BOOLEAN — true if the user is the document creator, false otherwise
  • collaborator_source STRING NULL — for non-creators, how the user arrived at the document (for example: invite link, email invite, search, direct). This field may be NULL for creators.

File 2: users.csv

Schema:

  • user_id STRING — unique user ID
  • country STRING — user country

Assumptions

  • Join key: document_activity.user_id = users.user_id .
  • Use the full dataset; do not filter by date.
  • A user may view the same document multiple times, so for collaboration questions you should deduplicate to distinct (page_id, user_id) pairs unless a task explicitly asks for event counts.
  • A document may have one creator and multiple collaborators.

Tasks

  1. Compute the distribution of collaborator acquisition sources.
    • Consider only non-creators ( is_creator = false ).
    • Define the metric on distinct collaborator-document pairs (page_id, user_id) .
    • Required output columns:
      • collaborator_source
      • collaborator_count
      • collaborator_share
  2. Given an input user_id, find which other users collaborated with that user the most.
    • Two users are considered to have collaborated if they both appeared on the same page_id at least once.
    • Count distinct shared documents, not view events.
    • Exclude the target user themself.
    • Required output columns:
      • target_user_id
      • co_user_id
      • shared_page_count
    • Return the user or users with the maximum shared_page_count .
  3. Find which country's users are most likely to collaborate with others.
    • To make the metric precise, define a user as having "collaborated with others" if they appear on at least one document with at least one other distinct user (that is, the document has 2 or more distinct users total).
    • For each country, compute:
      • collaborating_users = number of distinct users in that country who collaborated with at least one other user
      • total_users = number of distinct users in that country from users.csv
      • collaboration_rate = collaborating_users / total_users
    • Required output columns:
      • country
      • collaborating_users
      • total_users
      • collaboration_rate
    • Rank countries from highest to lowest collaboration rate.

Comments (0)

Sign in to leave a comment

Loading comments...

Browse More Questions

More Data Manipulation (SQL/Python)•More Notion•More Data Scientist•Notion Data Scientist•Notion Data Manipulation (SQL/Python)•Data Scientist Data Manipulation (SQL/Python)
PracHub

Master your tech interviews with 7,500+ real questions from top companies.

Product

  • Questions
  • Learning Tracks
  • Interview Guides
  • Resources
  • Premium
  • For Universities
  • Student Access

Browse

  • By Company
  • By Role
  • By Category
  • Topic Hubs
  • SQL Questions
  • Compare Platforms
  • Discord Community

Support

  • support@prachub.com
  • (916) 541-4762

Legal

  • Privacy Policy
  • Terms of Service
  • About Us

© 2026 PracHub. All rights reserved.