Calculate annual percentages and YoY by cohorts

Q: Calculate annual percentages and YoY by cohorts

This is a Data Manipulation (SQL/Python) interview question from CVS Health for Data Scientist roles. View the full question and solution on PracHub.

Q: How do I approach Data Manipulation (SQL/Python) interview questions?

Data Manipulation (SQL/Python) questions require understanding of core concepts and practice. PracHub provides solutions with explanations to help you master data manipulation (sql/python) interviews.

Question

Answer both SQL and Python parts. Be precise about deduping and denominator choices.

SQL schema (sample rows): orders order_id | user_id | order_date 1 | 101 | 2023-01-10 2 | 102 | 2023-05-03 3 | 101 | 2024-02-12 4 | 103 | 2024-11-20 5 | 104 | 2024-12-28

order_items order_id | product_id | qty 1 | 10 | 1 1 | 11 | 2 2 | 11 | 1 3 | 12 | 1 4 | 10 | 1 5 | 13 | 1

users user_id | age | location 101 | 27 | NY 102 | 42 | CA 103 | 35 | NY 104 | 23 | TX

A) Three-table percentage (CTE/subquery/case-when allowed): For calendar year 2024, compute the percentage of distinct orders that contained at least one product with category = 'Subscription'. Count each order at most once even if it has multiple subscription items. Output a single row with pct_subscription_2024 rounded to two decimals.

B) YoY change by location and age group: Define age_group buckets as [18–29], [30–44], [45+]. For each (location, age_group) present in users, compute distinct-order counts in 2023 and 2024 and the YoY percent change = (orders_2024 - orders_2023) / NULLIF(orders_2023, 0). Return columns: location, age_group, orders_2023, orders_2024, yoy_pct_change. If orders_2023 = 0, return NULL for yoy_pct_change (avoid divide-by-zero). Assume an order belongs to the age/location of its user at order time. You may use window functions or conditional aggregation.

Python part (use pandas): You are given two DataFrames with the same data as above: df_orders(order_id, user_id, order_date), df_products(product_id, name, category), df_users(user_id, age, location). For year Y = 2024, compute the number of unique users who purchased any product whose name contains the substring 'Pro' (case-insensitive). Return a DataFrame with columns [location, age_group, unique_users] where age_group uses the same bins as in part B, sorted by unique_users descending, then location ascending. You must use merge, str.contains, groupby, and an aggregation (nunique), and ensure stable sorting for ties.

Calculate annual percentages and YoY by cohorts

Comments (0)