Analyze Document Collaboration Patterns

Q: Analyze Document Collaboration Patterns

This question evaluates a data scientist's skills in data manipulation, join operations, deduplication, aggregation, timestamp handling, and metric definition within the Data Manipulation (SQL/Python) domain.

Q: How do I approach Data Manipulation (SQL/Python) interview questions?

Data Manipulation (SQL/Python) questions require understanding of core concepts and practice. PracHub provides solutions with explanations to help you master data manipulation (sql/python) interviews.

Question

You are given two CSV files and asked to analyze document collaboration behavior in Python, SQL, or pandas.

File 1: `page_events.csv`

Each row is a page-view event.

page_id STRING: unique document/page identifier
user_id STRING: user who viewed the document
viewed_at TIMESTAMP: event time of the view
is_creator BOOLEAN: true if this user is the creator/owner of the document, false otherwise
collaborator_source STRING: for non-creators, the channel through which the user first reached the document (for example: invite_email , share_link , search , comment_mention ); may be NULL for creators

File 2: `users.csv`

user_id STRING: unique user identifier
country STRING: user's country

Assumptions

users.user_id joins to page_events.user_id .
All timestamps are in UTC.
Use the full CSV contents as the observation window.
A user is considered a collaborator on a document if they appear on that page_id with is_creator = false at least once.
If a user views the same document multiple times, treat that as one (page_id, user_id) collaboration when computing user- or document-level metrics.
If a collaborator has multiple events on the same document, use the collaborator_source from their earliest viewed_at on that document.

Tasks

Distribution of how collaborators arrived at documents Compute the distribution of collaborator acquisition sources. Define the unit of analysis as a distinct (page_id, user_id) collaborator pair, using the collaborator's first view of that document. Return:
- collaborator_source
- collaborator_count = number of distinct collaborator-document pairs
- pct_of_all_collaborations = collaborator_count / total collaborator-document pairs
Who collaborated most with a given user? Given an input user_id , find the other user who collaborated with that user on the largest number of distinct documents. Define two users as having collaborated on a document if both appeared on the same page_id at least once. Return:
- input_user_id
- other_user_id
- shared_document_count
If there is a tie, return all tied users.
Which country has the highest collaboration rate? Join to users.csv and compute collaboration rate by country. For this exercise, define:
- numerator = number of distinct users in that country who were a non-creator collaborator on at least one document
- denominator = number of distinct users in that country who appeared in page_events at least once
Return:
- country
- collaborating_users
- active_users
- collaboration_rate
Identify the country or countries with the highest collaboration rate.

Analyze Document Collaboration Patterns

Quick Overview

File 1: `page_events.csv`

File 2: `users.csv`

Assumptions

Tasks

Comments (0)

Analyze Document Collaboration Patterns

Quick Overview

File 1: page_events.csv

File 2: users.csv

Assumptions

Tasks

Comments (0)

File 1: `page_events.csv`

File 2: `users.csv`