Count Article Types Viewed
You are given article view events and article metadata.
Table 1: article_views — one row per article view event.
| Column | Type | Description |
|---|
user_id | INT | The user who viewed the article |
article_id | INT | The article that was viewed |
view_date | DATE | UTC calendar date of the view |
Table 2: articles — maps each article to its type.
| Column | Type | Description |
|---|
article_id | INT | Article identifier |
article_type | VARCHAR | The category/type of the article |
Relationship: article_views.article_id = articles.article_id
Assumptions
-
Each row in
article_views
represents one article view event.
-
view_date
is a calendar date; treat dates as UTC and assume no timezone conversion is needed.
-
If a user views multiple articles of the same
article_type
, that type is counted
only once
for that user (distinct types).
Write SQL for the following:
-
Daily distinct types per user.
For
2019-01-01
, return the number of distinct article types viewed by each user.
-
Output columns:
user_id, num_article_types
-
Histogram across all dates.
Build a histogram of the number of distinct article types viewed per user across the full dataset (all available dates). For each possible value of
num_article_types
, return how many users viewed exactly that many distinct article types.
-
Output columns:
num_article_types, num_users
Constraints & Assumptions
-
Preserve the scope, facts, inputs, and requested outputs from the prompt above.
-
If the prompt leaves a detail unspecified, state a reasonable assumption before relying on it.
-
Keep the answer interview-ready: concise enough to present, but concrete enough to implement or evaluate.
Clarifying Questions to Ask
-
Clarify SQL dialect or Python library versions, date/time semantics, duplicate handling, and null handling.
-
Define the grain of each intermediate result before aggregating.
-
State expected output columns and ordering explicitly.
What a Strong Answer Covers
-
A query or pandas plan that matches the requested output grain.
-
Correct joins, filters, grouping, window functions, and treatment of NULLs or duplicates.
-
A brief explanation of why the result is correct and how it handles edge cases.
-
Performance notes, indexes/partitioning, and validation queries when relevant.
Follow-up Questions
-
How would you test the query on a tiny hand-built dataset?
-
What changes if duplicate events or late-arriving data are present?
-
Which indexes, clustering, or partitions would help at production scale?