This question evaluates a candidate's competency in data manipulation and analytics—specifically joins, deduplication of user-document pairs, aggregation, and computation of user- and country-level collaboration metrics using SQL or Python.
You are given two CSV files and asked to analyze collaboration behavior on documents.
document_activity.csvEach row is a document view event.
Schema:
page_id
STRING — unique document ID
user_id
STRING — user who viewed the document
viewed_at
TIMESTAMP — event timestamp in UTC
is_creator
BOOLEAN —
true
if the user is the document creator,
false
otherwise
collaborator_source
STRING NULL — for non-creators, how the user arrived at the document (for example: invite link, email invite, search, direct). This field may be
NULL
for creators.
users.csvSchema:
user_id
STRING — unique user ID
country
STRING — user country
document_activity.user_id = users.user_id
.
(page_id, user_id)
pairs unless a task explicitly asks for event counts.
is_creator = false
).
(page_id, user_id)
.
collaborator_source
collaborator_count
collaborator_share
user_id, find which other users collaborated with that user the most.
page_id
at least once.
target_user_id
co_user_id
shared_page_count
shared_page_count
.
collaborating_users
= number of distinct users in that country who collaborated with at least one other user
total_users
= number of distinct users in that country from
users.csv
collaboration_rate = collaborating_users / total_users
country
collaborating_users
total_users
collaboration_rate