Write SQL for content view metrics by device
Company: Newyorktimes
Role: Data Analyst
Category: Data Manipulation (SQL/Python)
Difficulty: medium
Interview Round: Technical Screen
You are a reporting analyst for a news subscription product. You have a fact table of page/content view events.
Assume the following table in Google BigQuery (Standard SQL):
`page_views` (one row per view event)
- `event_id` STRING (unique)
- `agent_id` STRING (an identifier for the reader/user agent)
- `device_type` STRING (one of: 'phone', 'tab', 'desktop')
- `content_id` STRING (nullable; may be NULL if not resolved)
- `url` STRING
- `content_type` STRING (e.g., 'article', 'video', 'interactive', etc.; may be NULL)
- `event_ts` TIMESTAMP (UTC)
Definitions:
- “Mobile” = device_type IN ('phone','tab')
- “Desktop” = device_type = 'desktop'
- A “unique content view” means counting DISTINCT `content_id` values (NULLs should not be counted as a distinct content).
Write SQL queries to answer each of the following:
1) Mobile vs Desktop unique contents viewed:
- Return two rows (mobile, desktop) with the number of unique contents viewed.
- Also include total view events for each device group so it’s clear how “# of unique contents viewed” differs from “# of views”.
- Output columns: `device_group`, `unique_contents_viewed`, `total_views`.
2) Peak usage hours by device group:
- For each device group (mobile, desktop), find the top 3 hours of day (0–23) with the most view events.
- Use `EXTRACT(HOUR FROM event_ts)`.
- Output columns: `device_group`, `hour_of_day`, `views`, `hour_rank`.
3) Content ID coverage:
- Compute (a) how many unique non-NULL contents were viewed and (b) how many view events have `content_id` IS NULL.
- Output columns: `unique_nonnull_contents_viewed`, `null_content_view_events`.
4) Distribution of first content type per agent per day:
- For each `agent_id` and calendar date (UTC), find that agent’s first view event of the day (earliest `event_ts`).
- Take the `content_type` of that first event and compute the overall distribution across all agent-days.
- Output columns: `content_type`, `agent_days`, `share_of_agent_days`.
Notes:
- If there are ties for the first event timestamp within an agent-day, break ties deterministically using `event_id` ascending.
- You may assume a large dataset; write SQL that is idiomatic for BigQuery.
Quick Answer: Evaluates SQL data-manipulation and analytics concepts—aggregation with DISTINCT, NULL handling, grouping, time extraction, deterministic tie-breaking, and window functions for ranking—within the Data Manipulation (SQL/Python) category at an intermediate-to-advanced level.