Write SQL for rankings, state, and aggregations
Company: LinkedIn
Role: Data Scientist
Category: Data Manipulation (SQL/Python)
Difficulty: easy
Interview Round: Technical Screen
You are given several independent datasets. For each dataset, write the required SQL query (ANSI SQL is fine). Assume dates are in UTC and `DATE` columns are calendar dates (no time component) unless otherwise specified.
---
## Dataset A — Continent populations
### Table: `world_population`
| column | type | notes |
|---|---|---|
| continent | STRING | e.g., `North America` |
| country | STRING | country name |
| population | BIGINT | population for that country |
**Task A1.** For each `continent`, return the single country with the **largest** `population`.
**Output:** `continent, country, population`
**Task A1 follow-up.** For each `continent`, return the **top 2** countries by `population`.
**Output:** `continent, country, population`
**Task A2.** For each `continent`, compute the **population share** of each of the **top 2** countries, where:
- `total_population` = sum of `population` within the continent
- `population_share_pct` = `100 * population / total_population`
**Output:** `continent, country, population, total_population, population_share_pct`
---
## Dataset B — Push notification status as of a cutoff date
You have a baseline snapshot as of **2020-01-31** and an action log for February.
### Table: `status`
"Contains all members’ latest push notification setting status as of 2020-01-31."
| column | type | notes |
|---|---|---|
| member_id | BIGINT | primary key |
| status | STRING | one of `on`, `off` |
### Table: `actions`
"All actions members made in Feb 2020 (after 2020-01-31). Assume at most one action per member per day."
| column | type | notes |
|---|---|---|
| member_id | BIGINT | may include new members not in `status` |
| date_sk | DATE | action date in Feb 2020 |
| action | STRING | one of `turn_on`, `turn_off` |
**Task B1.** Return each member’s **current notification status as of 2020-02-29**.
Requirements:
- If a member has one or more rows in `actions`, their latest action by `date_sk` determines the current status.
- Members who never acted in Feb retain their `status` from the snapshot.
- The `actions` table may include **new members** not present in `status`; they must still appear in the output.
**Output:** `member_id, current_status`
**Task B1 follow-up.** If the `actions` table did **not** contain `date_sk` (only action events without timestamps, or only aggregated action counts), explain what additional assumptions/data you would need, and under a reasonable assumption, compute the final status.
---
## Dataset C — Video posts and members
### Table: `video_posts`
| column | type | notes |
|---|---|---|
| post_date | DATE | date the video was posted |
| memberid | BIGINT | member who posted |
| video_length | INT | length in seconds |
### Table: `members`
| column | type | notes |
|---|---|---|
| memberid | BIGINT | primary key |
| country | STRING | e.g., `usa`, `uk` |
| join_date | DATE | date the member joined |
**Task C1.** How many members posted their **first-ever** video on the **same day** they joined?
**Output:** single row, `members_first_video_same_day`
**Task C2.** For each `country`, how many **distinct members** have posted at least one video longer than **60 seconds**?
**Output:** `country, num_members`
---
## Dataset D — Article views and types
### Table: `article_views`
| column | type | notes |
|---|---|---|
| user_id | BIGINT | viewer |
| article_id | BIGINT | viewed article |
| date | DATE | view date |
### Table: `articles`
| column | type | notes |
|---|---|---|
| article_id | BIGINT | primary key |
| article_type | STRING | e.g., `sports`, `tech` |
**Task D1.** For date **2019-01-01**, compute the number of **distinct article types** each user viewed.
**Output:** `user_id, num_article_types`
**Task D2.** Build a histogram of `num_article_types` across users (i.e., how many users viewed 1 type, 2 types, etc.).
**Output:** `num_article_types, num_users`
Quick Answer: This question evaluates a candidate's competency in SQL data manipulation, covering ranking/top-N queries, aggregations and percentage calculations, temporal state reconstruction from action logs, and detection of first-occurrence events.