Calculate Pirated Usage and Revenue Loss
Company: Shopify
Role: Data Scientist
Category: Data Manipulation (SQL/Python)
Difficulty: hard
Interview Round: Technical Screen
You are analyzing theme piracy on an e-commerce platform.
Assume the analysis window is **2023-01-01 through 2023-12-31**, all timestamps are stored in **UTC**, and all monthly buckets should be computed by **calendar month in UTC**.
### Tables
1. **shops**
- `shop_id BIGINT`
- `created_at TIMESTAMP`
2. **themes**
- `theme_id BIGINT`
- `theme_name STRING`
- `is_pirated BOOLEAN`
3. **pirated_theme_map**
- `pirated_theme_id BIGINT`
- `legitimate_theme_id BIGINT`
- Each pirated theme maps to the legitimate theme it copies.
4. **theme_pricing**
- `theme_id BIGINT`
- `monthly_license_fee_usd NUMERIC(10,2)`
5. **theme_installations**
- `install_id BIGINT`
- `shop_id BIGINT`
- `theme_id BIGINT`
- `installed_at TIMESTAMP`
- `valid_from DATE`
- `valid_to DATE NULL`
- Each row represents one continuous period during which a theme is active on a shop.
6. **calendar_months**
- `month_start DATE`
- One row per month in the analysis window.
### Key relationships
- `theme_installations.shop_id -> shops.shop_id`
- `theme_installations.theme_id -> themes.theme_id`
- `pirated_theme_map.pirated_theme_id -> themes.theme_id`
- `pirated_theme_map.legitimate_theme_id -> theme_pricing.theme_id`
### Definitions
- A **pirated installation** is any row in `theme_installations` whose `theme_id` appears in `pirated_theme_map.pirated_theme_id`.
- For monthly activity logic, treat an installation as active in a month if its active interval overlaps that month.
- If `valid_to` is `NULL`, treat the installation as active through **2023-12-31**.
- Estimated **revenue loss** for a pirated installation equals the `monthly_license_fee_usd` of the mapped legitimate theme for every month in which the pirated installation is active for at least one day.
### Tasks
1. Compute the monthly **pirated install rate**, defined as:
- numerator = distinct `shop_id` values that installed at least one pirated theme during that month
- denominator = distinct `shop_id` values that installed any theme during that month
Required output columns:
- `month_start`
- `pirated_install_shops`
- `all_install_shops`
- `pirated_install_rate`
2. Compute the monthly **active pirated usage rate**, defined as:
- numerator = distinct `shop_id` values with at least one pirated theme active at any point during that month
- denominator = distinct `shop_id` values with at least one theme active at any point during that month
Required output columns:
- `month_start`
- `active_pirated_shops`
- `active_theme_shops`
- `active_pirated_rate`
3. Compute **revenue loss over time** by month. Because the business asks for the loss "over time," return both:
- monthly estimated loss in that month
- cumulative estimated loss from the start of the analysis window up to that month
Required output columns:
- `month_start`
- `monthly_revenue_loss_usd`
- `cumulative_revenue_loss_usd`
Write SQL only.
Quick Answer: This question evaluates competency in temporal data manipulation, SQL joins and aggregations, distinct counting, and revenue-impact estimation within a Data Manipulation (SQL/Python) context for a data scientist role.