Write Queries for Pinterest Engagement Tasks
Company: Pinterest
Role: Data Scientist
Category: Data Manipulation (SQL/Python)
Difficulty: medium
Interview Round: Technical Screen
You are given several data-manipulation tasks based on Pinterest-style product data. Use UTC for all timestamp-to-date conversions unless stated otherwise. Treat time windows as inclusive at the start and exclusive at the end.
Task 1: SQL CTR by pin format for new US users
Tables:
`events`
- `event_id BIGINT`: unique event identifier.
- `user_id BIGINT`: user who took the action.
- `pin_id BIGINT`: pin involved in the action.
- `event_timestamp TIMESTAMP`: time of the action in UTC.
- `event_type VARCHAR`: either `impression` or `click`.
`users_d`
- `user_id BIGINT`: unique user identifier.
- `signup_timestamp TIMESTAMP`: user signup time in UTC.
- `country_code VARCHAR`: ISO country code, for example `US`.
`pins_classification`
- `pin_id BIGINT`: unique pin identifier.
- `pin_format VARCHAR`: pin format, for example `image`, `video`, or `idea`.
- `category_id BIGINT`: optional content category identifier.
Relationships:
- `events.user_id = users_d.user_id`.
- `events.pin_id = pins_classification.pin_id`.
Definition:
- Click-through rate, or CTR, is `number of clicks / number of impressions`.
- A user is considered new for an event if the user signed up before or at the event timestamp and less than 30 days before the event timestamp.
- Only include users with `country_code = US`.
Write a SQL query that returns one row per `event_date` and `pin_format` with the following columns:
- `event_date DATE`
- `pin_format VARCHAR`
- `impressions BIGINT`
- `clicks BIGINT`
- `ctr DOUBLE`
Task 2: SQL percentage of users who saw fresh content
Tables:
`pins`
- `pin_id BIGINT`: unique pin identifier.
- `created_timestamp TIMESTAMP`: pin creation time in UTC.
`pin_impression_events`
- `user_id BIGINT`: user who saw the pin.
- `pin_id BIGINT`: pin that was impressed.
- `impression_timestamp TIMESTAMP`: time of the impression in UTC.
- `impression_type VARCHAR`: impression surface or type, such as `home_feed`, `search`, or `related_pins`.
- `impression_count BIGINT`: number of impressions represented by the row.
Definitions:
- A pin is fresh if it receives at least 2 total impressions within 7 days of `created_timestamp`, using the window `[created_timestamp, created_timestamp + interval 7 days)`.
- A user saw fresh content if the user had at least one impression of a fresh pin during the reporting window.
- The denominator is distinct users with at least one impression during the reporting window.
- Use parameters `:start_timestamp` and `:end_timestamp` for the reporting window.
Write a SQL query that returns:
- `total_users BIGINT`
- `users_who_saw_fresh_content BIGINT`
- `pct_users_who_saw_fresh_content DOUBLE`
Task 3: Python similar pins by board co-occurrence
You are given a dictionary `boards` mapping each board id to the unique pins saved on that board. For a given pin, compute the similarity score between the given pin and every other pin. The similarity score is the number of boards on which the other pin appears together with the given pin.
Example input:
`boards = {b1: [a, c, d, e], b2: [a, c, e, f], b3: [a, c, g, i]}` and `given_pin = a`.
Expected scores:
- `c = 3`
- `e = 2`
- `d = 1`
- `f = 1`
- `g = 1`
- `i = 1`
Do not output pins with similarity score 0. Return pins with their scores, sorted by descending score and then by ascending pin id. If an optional argument `n` is provided, return only the top `n` pins.
Task 4: Python or pandas category with highest average time spent for video pins
You are given a pandas DataFrame `engagement` with columns:
- `user_id`
- `pin_id`
- `pin_type`: for example `image` or `video`.
- `category_id`
- `time_spent_seconds`
You are also given a dictionary `category_mapping` from `category_id` to a human-readable category name. For rows whose `category_id` is missing from the dictionary or null, map the category to `unknown`.
Find the category with the highest average `time_spent_seconds` among video pins only. Return the category name, average time spent, and number of video-pin rows used for that category. Specify how you would handle ties.
Quick Answer: This question evaluates SQL-based data manipulation and Python set/co-occurrence computation skills, covering aggregation, joins, timestamp-based cohorting and windowing, freshness detection, CTR calculation, and similarity scoring within a data scientist context.