Compute pirated-theme usage and revenue loss
Company: Shopify
Role: Data Scientist
Category: Data Manipulation (SQL/Python)
Difficulty: easy
Interview Round: Technical Screen
You work on a theme marketplace. Some shops install **pirated themes** instead of paying for official themes.
Assume all timestamps are in **UTC**.
## Tables
### `shops`
- `shop_id` BIGINT PRIMARY KEY
- `created_at` TIMESTAMP
### `theme_installs`
Represents a shop having a theme active for a time range.
- `shop_id` BIGINT (FK → `shops.shop_id`)
- `theme_id` BIGINT
- `valid_from` TIMESTAMP
- `valid_to` TIMESTAMP NULL — NULL means the install is still active
### `pirated_themes`
List of themes known to be pirated.
- `theme_id` BIGINT PRIMARY KEY
### `theme_prices`
Monthly price for an official theme.
- `theme_id` BIGINT PRIMARY KEY
- `price_usd_per_month` NUMERIC(10,2)
### `calendar_months`
A helper table with one row per month.
- `month_start` DATE PRIMARY KEY — e.g., 2024-01-01
## Q1 — Is pirated usage increasing?
For each calendar month, compute the percent of shops that have **at least one active install of a pirated theme at any time during that month**.
- Define “active during a month” as the install interval overlapping the month:
`valid_from < next_month_start AND coalesce(valid_to, <as_of_timestamp>) >= month_start`
- Use `COALESCE(valid_to, <as_of_timestamp>)` to handle NULL `valid_to`.
### Output
- `month_start`
- `pirated_shops` (count of distinct `shop_id` with pirated theme active during the month)
- `total_shops` (count of distinct `shop_id` with any theme active during the month)
- `pirated_shop_pct` (`pirated_shops / total_shops`)
## Q2 — Estimate revenue loss “over time” (cumulative)
Assume each pirated theme would have generated revenue equal to its official theme’s monthly price in `theme_prices` while it is active.
Compute:
1) `monthly_revenue_loss_usd`: for each month, sum `price_usd_per_month` across distinct `(shop_id, theme_id)` that are pirated and active during the month.
2) `cumulative_revenue_loss_usd`: the rolling cumulative sum up through each month (i.e., “as of January”, “as of February”, …).
### Output
- `month_start`
- `monthly_revenue_loss_usd`
- `cumulative_revenue_loss_usd`
Notes:
- You may use `calendar_months` and a join to expand installs into months.
- Clearly handle NULL `valid_to` using `COALESCE`.
Quick Answer: This question evaluates SQL data manipulation and time-series analytics skills, focusing on interval overlap logic, aggregation of distinct entities, and cumulative revenue calculations.