This question evaluates a data scientist's competency in cohort-based retention analysis, time-based data manipulation, and aggregation using Data Manipulation (SQL/Python) techniques.
You are given two tables:
users(user_id BIGINT, signup_ts TIMESTAMP)
— one row per user.
user_activity(user_id BIGINT, activity_ts TIMESTAMP, event_name VARCHAR)
— zero or more activity rows per user.
Assume all timestamps are in UTC. Define a user's cohort as the calendar month of signup_ts. A user is retained in month k if they have at least one row in user_activity during the calendar month that is k months after their cohort month. Month 0 is the signup month.
Write a SQL query to compute cohort-based monthly retention for k = 0, 1, 2, 3, 4, 5, 6.
Return these columns:
cohort_month
— first day of the signup month
months_since_signup
— integer from 0 to 6
cohort_size
— number of users in the cohort
retained_users
— number of distinct users active in that relative month
retention_rate
—
retained_users / cohort_size
as a decimal
Make sure each user is counted at most once per cohort-month and relative month.