This question evaluates a candidate's competency in SQL data manipulation, covering ranking/top-N queries, aggregations and percentage calculations, temporal state reconstruction from action logs, and detection of first-occurrence events.
You are given several independent datasets. For each dataset, write the required SQL query (ANSI SQL is fine). Assume dates are in UTC and DATE columns are calendar dates (no time component) unless otherwise specified.
world_population| column | type | notes |
|---|---|---|
| continent | STRING | e.g., North America |
| country | STRING | country name |
| population | BIGINT | population for that country |
Task A1. For each continent, return the single country with the largest population.
Output: continent, country, population
Task A1 follow-up. For each continent, return the top 2 countries by population.
Output: continent, country, population
Task A2. For each continent, compute the population share of each of the top 2 countries, where:
total_population
= sum of
population
within the continent
population_share_pct
=
100 * population / total_population
Output: continent, country, population, total_population, population_share_pct
You have a baseline snapshot as of 2020-01-31 and an action log for February.
status"Contains all members’ latest push notification setting status as of 2020-01-31."
| column | type | notes |
|---|---|---|
| member_id | BIGINT | primary key |
| status | STRING | one of on, off |
actions"All actions members made in Feb 2020 (after 2020-01-31). Assume at most one action per member per day."
| column | type | notes |
|---|---|---|
| member_id | BIGINT | may include new members not in status |
| date_sk | DATE | action date in Feb 2020 |
| action | STRING | one of turn_on, turn_off |
Task B1. Return each member’s current notification status as of 2020-02-29.
Requirements:
actions
, their latest action by
date_sk
determines the current status.
status
from the snapshot.
actions
table may include
new members
not present in
status
; they must still appear in the output.
Output: member_id, current_status
Task B1 follow-up. If the actions table did not contain date_sk (only action events without timestamps, or only aggregated action counts), explain what additional assumptions/data you would need, and under a reasonable assumption, compute the final status.
video_posts| column | type | notes |
|---|---|---|
| post_date | DATE | date the video was posted |
| memberid | BIGINT | member who posted |
| video_length | INT | length in seconds |
members| column | type | notes |
|---|---|---|
| memberid | BIGINT | primary key |
| country | STRING | e.g., usa, uk |
| join_date | DATE | date the member joined |
Task C1. How many members posted their first-ever video on the same day they joined?
Output: single row, members_first_video_same_day
Task C2. For each country, how many distinct members have posted at least one video longer than 60 seconds?
Output: country, num_members
article_views| column | type | notes |
|---|---|---|
| user_id | BIGINT | viewer |
| article_id | BIGINT | viewed article |
| date | DATE | view date |
articles| column | type | notes |
|---|---|---|
| article_id | BIGINT | primary key |
| article_type | STRING | e.g., sports, tech |
Task D1. For date 2019-01-01, compute the number of distinct article types each user viewed.
Output: user_id, num_article_types
Task D2. Build a histogram of num_article_types across users (i.e., how many users viewed 1 type, 2 types, etc.).
Output: num_article_types, num_users