How do I approach Data Manipulation (SQL/Python) interview questions?

Data Manipulation (SQL/Python) questions require understanding of core concepts and practice. PracHub provides solutions with explanations to help you master data manipulation (sql/python) interviews.

What difficulty level is this interview question?

This is a Medium difficulty Data Manipulation (SQL/Python) question, commonly asked during Onsite rounds at Netflix.

What role is this question designed for?

This question is commonly asked for Data Scientist candidates at Netflix during technical interviews.

Aggregate D1 retention cohorts in SQL | Netflix Interview Question

Aggregate D1 retention cohorts in SQL

Company: Netflix

Role: Data Scientist

Category: Data Manipulation (SQL/Python)

Difficulty: Medium

Interview Round: Onsite

Today is 2025-09-01. Using SQL (optionally outline a pandas approach too), compute daily engagement and D1 retention for the last 7 days (2025-08-26 through 2025-09-01). Exclude any user_id beginning with 'bot_'. Use the following schema and sample data. Schema: users(user_id STRING, signup_date DATE, country STRING) events(user_id STRING, event_time TIMESTAMP, event_type STRING, product_id STRING) purchases(user_id STRING, purchase_time TIMESTAMP, order_id STRING, amount DECIMAL(10,2)) Sample tables (UTC times): users | user_id | signup_date | country | | u1 | 2025-08-25 | US | | u2 | 2025-08-26 | US | | u3 | 2025-08-26 | IN | | u4 | 2025-08-31 | US | | u5 | 2025-09-01 | US | | bot_1 | 2025-08-26 | US | events | user_id | event_time | event_type | product_id | | u1 | 2025-08-26 09:00:00 | app_open | NULL | | u1 | 2025-08-27 10:00:00 | app_open | NULL | | u2 | 2025-08-26 12:00:00 | app_open | NULL | | u2 | 2025-08-27 13:00:00 | view | P1 | | u2 | 2025-08-28 12:00:00 | app_open | NULL | | u3 | 2025-08-26 15:00:00 | app_open | NULL | | u3 | 2025-08-27 16:00:00 | app_open | NULL | | u4 | 2025-09-01 08:00:00 | app_open | NULL | | u5 | 2025-09-01 09:00:00 | app_open | NULL | | bot_1 | 2025-08-26 10:00:00 | app_open | NULL | purchases | user_id | purchase_time | order_id | amount | | u1 | 2025-08-27 11:00:00 | o1 | 9.99 | | u2 | 2025-08-28 11:30:00 | o2 | 4.00 | | u3 | 2025-08-29 14:00:00 | o3 | 2.50 | | u4 | 2025-09-01 09:30:00 | o4 | 1.00 | | u5 | 2025-09-02 10:00:00 | o5 | 5.00 | | bot_1 | 2025-08-26 10:30:00 | o6 | 100.00 | Task: Produce a single SQL query returning one row per day d with the following columns: (1) day (DATE), (2) dau: distinct users with any event on day d, (3) new_users: users whose first-ever event_time is on day d, (4) d1_retention_rate: among new_users on day d, fraction with any event on day d+1, (5) revenue_per_dau: total purchase amount on day d divided by dau. Notes: treat days in UTC; deduplicate exact-duplicate events via SELECT DISTINCT user_id, event_time, event_type, product_id before aggregating; exclude 'bot_%' users from all metrics. Provide the SQL and, briefly, how you would compute the same using pandas (high-level steps).

Quick Answer: This question evaluates a candidate's competency in time-series cohort analysis and SQL-based data manipulation, specifically measuring DAU, new-user cohorts, D1 retention and revenue-per-user while handling deduplication and user filtering.

Schema: users(user_id STRING, signup_date DATE, country STRING) events(user_id STRING, event_time TIMESTAMP, event_type STRING, product_id STRING) purchases(user_id STRING, purchase_time TIMESTAMP, order_id STRING, amount DECIMAL(10,2))

Sample tables (UTC times): users | user_id | signup_date | country | | u1 | 2025-08-25 | US | | u2 | 2025-08-26 | US | | u3 | 2025-08-26 | IN | | u4 | 2025-08-31 | US | | u5 | 2025-09-01 | US | | bot_1 | 2025-08-26 | US |

events | user_id | event_time | event_type | product_id | | u1 | 2025-08-26 09:00:00 | app_open | NULL | | u1 | 2025-08-27 10:00:00 | app_open | NULL | | u2 | 2025-08-26 12:00:00 | app_open | NULL | | u2 | 2025-08-27 13:00:00 | view | P1 | | u2 | 2025-08-28 12:00:00 | app_open | NULL | | u3 | 2025-08-26 15:00:00 | app_open | NULL | | u3 | 2025-08-27 16:00:00 | app_open | NULL | | u4 | 2025-09-01 08:00:00 | app_open | NULL | | u5 | 2025-09-01 09:00:00 | app_open | NULL | | bot_1 | 2025-08-26 10:00:00 | app_open | NULL |

purchases | user_id | purchase_time | order_id | amount | | u1 | 2025-08-27 11:00:00 | o1 | 9.99 | | u2 | 2025-08-28 11:30:00 | o2 | 4.00 | | u3 | 2025-08-29 14:00:00 | o3 | 2.50 | | u4 | 2025-09-01 09:30:00 | o4 | 1.00 | | u5 | 2025-09-02 10:00:00 | o5 | 5.00 | | bot_1 | 2025-08-26 10:30:00 | o6 | 100.00 |

Task: Produce a single SQL query returning one row per day d with the following columns: (1) day (DATE), (2) dau: distinct users with any event on day d, (3) new_users: users whose first-ever event_time is on day d, (4) d1_retention_rate: among new_users on day d, fraction with any event on day d+1, (5) revenue_per_dau: total purchase amount on day d divided by dau. Notes: treat days in UTC; deduplicate exact-duplicate events via SELECT DISTINCT user_id, event_time, event_type, product_id before aggregating; exclude 'bot_%' users from all metrics. Provide the SQL and, briefly, how you would compute the same using pandas (high-level steps).

Aggregate D1 retention cohorts in SQL

Company: Netflix

Role: Data Scientist

Category: Data Manipulation (SQL/Python)

Difficulty: Medium

Interview Round: Onsite

Aggregate D1 retention cohorts in SQL

Quick Overview

Aggregate D1 retention cohorts in SQL

Write your answer

Aggregate D1 retention cohorts in SQL

Quick Overview

Aggregate D1 retention cohorts in SQL

Write your answer