How do I practice SQL interview questions?

PracHub provides an interactive SQL console where you can write and test queries against real database schemas. Get instant feedback and compare your solution with the expected output.

What difficulty level is this coding question?

This is a easy difficulty Data Manipulation (SQL/Python) question, commonly asked during Technical Screen rounds at Notion.

What role is this question designed for?

This question is commonly asked for Data Scientist candidates at Notion during technical interviews.

Analyze document collaboration from CSV logs

Company: Notion

Role: Data Scientist

Category: Data Manipulation (SQL/Python)

Difficulty: easy

Interview Round: Technical Screen

You are given two CSV files exported from a document collaboration product. ## Data ### 1) `doc_access.csv` Each row represents a user accessing (viewing/editing) a document page. | column | type | description | |---|---|---| | `page_id` | STRING/INT | Document/page identifier | | `user_id` | STRING/INT | User who accessed the document | | `event_ts` | TIMESTAMP | When the access happened (assume UTC) | | `is_creator` | BOOLEAN | `TRUE` if this user is the creator/owner of the document; otherwise the user is a collaborator | | `collab_source` | STRING | For collaborators only (`is_creator=FALSE`): how the user became a collaborator (e.g., invited, link share, org default, etc.) | Notes/assumptions: - A “collaborator” is any record with `is_creator = FALSE`. - A user may access the same `page_id` multiple times; unless otherwise specified, treat collaboration at the **document level** (i.e., dedupe by `(page_id, user_id)` where appropriate). ### 2) `users.csv` | column | type | description | |---|---|---| | `user_id` | STRING/INT | User identifier | | `country` | STRING | User’s country | ## Tasks ### Q1) Distribution of collaborator sources Compute the distribution of how collaborators got onto documents. **Output:** - `collab_source` - `num_collaborators` (count of collaborator users; specify whether you count unique users overall or unique `(page_id, user_id)` pairs) - `pct_of_collaborators` ### Q2) For a given user, who did they collaborate with most? Given an input parameter `target_user_id`, find the other user they collaborated with the most. Define “collaborated with” as: the two users both appear on the same `page_id` (after deduping to unique `(page_id, user_id)` memberships). **Output:** - `target_user_id` - `top_collaborator_user_id` - `num_shared_pages` (number of distinct `page_id` where both users appear) Also describe how you would break ties (e.g., smallest user_id, most recent shared collaboration, etc.). ### Q3) Which country collaborates the most? Determine which country’s users are most likely to collaborate with others. Because “collaboration rate” can be defined multiple ways, do the following: 1. Propose a clear numerator and denominator for a per-country collaboration rate. - Example definition (acceptable): - Numerator: # of users in the country who have collaborated with at least one other distinct user on at least one page. - Denominator: total # of users in the country. 2. Compute the collaboration rate by country and return the top country. **Output:** - `country` - `collaboration_rate` - any intermediate counts you used (e.g., `num_users_in_country`, `num_users_who_collaborated`).

Quick Answer: This question evaluates data manipulation and analytical competencies—particularly deduplication, joins, aggregation, co-occurrence counting, and metric definition—applied to CSV access logs and user metadata.

Quick Overview

This question evaluates data manipulation and analytical competencies—particularly deduplication, joins, aggregation, co-occurrence counting, and metric definition—applied to CSV access logs and user metadata.