Write SQL for content-view analytics
Company: Newyorktimes
Role: Data Analyst
Category: Data Manipulation (SQL/Python)
Difficulty: easy
Interview Round: Technical Screen
## Context
You work with page-view event logs and need to compute several engagement/usage summaries.
Assume a single table:
### `page_views`
| column | type | description |
|---|---|---|
| `view_id` | STRING/INT | unique event id (PK) |
| `agent_id` | STRING/INT | identifier for the viewer ("agent") |
| `content_id` | STRING/INT | identifier of the content item (nullable) |
| `url` | STRING | URL viewed |
| `content_type` | STRING | e.g., `article`, `video`, `podcast`, etc. |
| `device_type` | STRING | one of `phone`, `tab`, `desktop` |
| `viewed_at` | TIMESTAMP | event timestamp |
Assumptions:
- Treat **mobile** as `device_type IN ('phone','tab')`, and **desktop** as `device_type = 'desktop'`.
- Unless otherwise specified, interpret timestamps in **UTC**.
---
## Tasks (write SQL)
### 1) Unique contents viewed by device group
Compute, for each device group (`mobile` vs `desktop`):
- `unique_contents_viewed`: number of distinct `content_id` values viewed (ignore NULLs)
- `total_views`: total number of page-view events
Return columns: `device_group`, `unique_contents_viewed`, `total_views`.
Additionally (in 1–2 sentences), explain whether `unique_contents_viewed` and `total_views` will generally be equal, and why.
---
### 2) Peak usage hours by device type
For each **device type** (`phone`, `tab`, `desktop`), find the **top 3 hours of day** (0–23) with the highest number of views.
Return columns: `device_type`, `hour_of_day`, `views`.
Notes:
- Use an hour-extraction function on `viewed_at`.
- Break ties by choosing the smaller `hour_of_day` first.
---
### 3) Unique content count and NULL content count
Compute:
- `unique_contents_viewed`: number of distinct non-NULL `content_id` values that were viewed
- `null_content_events`: number of events where `content_id` IS NULL
Return columns: `unique_contents_viewed`, `null_content_events`.
---
### 4) Distribution of first-seen content type per agent per day
For each `agent_id` and each calendar date (derived from `viewed_at`), identify the **first** page view of that day (earliest `viewed_at`; if ties, use smallest `view_id`). Take that event’s `content_type`.
Then compute the distribution of these first-seen `content_type` values across all agent-days.
Return columns:
- `event_date`
- `content_type`
- `agent_days` (count of agent-days whose first view had that `content_type`)
- `share` (agent_days divided by total agent-days for that date)
Quick Answer: This question evaluates proficiency in SQL-based data manipulation and analytics, including aggregation, distinct counting, NULL handling, timestamp/hour extraction, device grouping, and row-level ordering, and is situated in the Data Manipulation (SQL/Python) domain.