This question evaluates a data scientist's skills in data manipulation, join operations, deduplication, aggregation, timestamp handling, and metric definition within the Data Manipulation (SQL/Python) domain.
You are given two CSV files and asked to analyze document collaboration behavior in Python, SQL, or pandas.
page_events.csvEach row is a page-view event.
page_id
STRING: unique document/page identifier
user_id
STRING: user who viewed the document
viewed_at
TIMESTAMP: event time of the view
is_creator
BOOLEAN:
true
if this user is the creator/owner of the document,
false
otherwise
collaborator_source
STRING: for non-creators, the channel through which the user first reached the document (for example:
invite_email
,
share_link
,
search
,
comment_mention
); may be
NULL
for creators
users.csvuser_id
STRING: unique user identifier
country
STRING: user's country
users.user_id
joins to
page_events.user_id
.
page_id
with
is_creator = false
at least once.
(page_id, user_id)
collaboration when computing user- or document-level metrics.
collaborator_source
from their earliest
viewed_at
on that document.
(page_id, user_id)
collaborator pair, using the collaborator's first view of that document.
Return:
collaborator_source
collaborator_count
= number of distinct collaborator-document pairs
pct_of_all_collaborations
=
collaborator_count / total collaborator-document pairs
user_id
, find the other user who collaborated with that user on the largest number of distinct documents.
Define two users as having collaborated on a document if both appeared on the same
page_id
at least once.
Return:
input_user_id
other_user_id
shared_document_count
users.csv
and compute collaboration rate by country.
For this exercise, define:
page_events
at least once
country
collaborating_users
active_users
collaboration_rate