Build cohort 30-day retention from signup date

Q: Build cohort 30-day retention from signup date

This is a Analytics & Experimentation interview question from Intuit for Data Scientist roles. View the full question and solution on PracHub.

Q: How do I approach Analytics & Experimentation interview questions?

Analytics & Experimentation questions require understanding of core concepts and practice. PracHub provides solutions with explanations to help you master analytics & experimentation interviews.

Question

Task: Monthly Cohorted Retention (30-day, then 30/60/90-day curve)

You are given a single table, company, with columns:

company_id (INT)
signup_date (DATE)
subscription_date (DATE, nullable)
termination_date (DATE, nullable)

Compute monthly cohorted retention based on signup_date. Define cohort_month = DATE_TRUNC('month', signup_date).

Assume data_cutoff_date is a parameter representing the latest reliable observation date. If not provided, default it to the maximum observed date across signup_date, termination_date, subscription_date:

data_cutoff_date = GREATEST(MAX(signup_date), MAX(termination_date), MAX(subscription_date))

Right-censoring rule: include only cohorts for which the retention window is fully observable at the cohort level. Specifically, exclude any cohort where cohort_month + INTERVAL '30 days' > data_cutoff_date.

A company is considered retained at day D if:

termination_date IS NULL, OR
termination_date > signup_date + INTERVAL 'D days'

Edge cases to handle:

termination_date exactly equals signup_date + 30 days is NOT retained.
NULL termination_date implies retained at any threshold (subject to censoring rule).
Companies without subscription_date are still included in cohorts and evaluated for retention.
Present rates as retained_xxd / cohort_size with safe division and 0.0 when cohort_size = 0.

Return for the 30-day retention:

cohort_month (DATE)
cohort_size (INT)
retained_30d (INT)
retention_30d_rate (DECIMAL, 4 decimals)

Then generalize to a monthly retention curve by adding 60 and 90 days using a generate_series of thresholds (30, 60, 90) and pivoting/widening the results into:

retained_30d, retained_60d, retained_90d (INT)
retention_30d_rate, retention_60d_rate, retention_90d_rate (DECIMAL, 4 decimals)

Table (sample):

company +------------+-------------+-------------------+------------------+ | company_id | signup_date | subscription_date | termination_date | +------------+-------------+-------------------+------------------+ | 1 | 2019-05-20 | 2019-06-02 | 2020-01-10 | | 2 | 2019-06-15 | 2019-06-20 | NULL | | 3 | 2019-07-01 | 2019-07-05 | 2019-12-31 | | 4 | 2020-06-10 | 2020-06-11 | NULL | | 5 | 2020-06-30 | 2020-07-02 | NULL | | 6 | 2020-07-15 | NULL | NULL | +------------+-------------+-------------------+------------------+

Build cohort 30-day retention from signup date

Task: Monthly Cohorted Retention (30-day, then 30/60/90-day curve)

Solution (Locked)

Comments (0)