This question evaluates a candidate's ability to perform relational joins, deduplicate by categorical attributes, apply date-based filtering, and compute grouped aggregations and distributional histograms, within the domain of Data Manipulation (SQL/Python) for a Data Scientist role.
You are given article view events and article metadata.
article_views(user_id INT, article_id INT, view_date DATE)
articles(article_id INT, article_type VARCHAR)
Join key: article_views.article_id = articles.article_id.
Assume view_date is a calendar date and timezone handling is not needed.
Tasks:
2019-01-01
, return the number of distinct article types viewed by each user.
num_article_types
, return how many users viewed exactly that many distinct article types.
Definitions:
article_type
, count that type only once for that user.
Required output:
user_id, num_article_types
num_article_types, num_users