This question evaluates a data scientist's competency in cohort-based GMV calculation, payer-rate measurement, and handling common edge cases across events, orders, payments, and refunds.

You are given the following schema (timestamps are UTC):
users(user_id INT, country STRING, created_at TIMESTAMP) events(user_id INT, event_ts TIMESTAMP, event_type STRING) -- only app_open rows are relevant orders(order_id INT, user_id INT, order_ts TIMESTAMP, status STRING) -- status in ('placed','canceled') payments(payment_id INT, order_id INT, amount DECIMAL(10,2), payment_ts TIMESTAMP, is_valid BOOL) refunds(refund_id INT, order_id INT, amount DECIMAL(10,2), refund_ts TIMESTAMP)
Small sample data:
users +---------+---------+---------------------+ | user_id | country | created_at | +---------+---------+---------------------+ | 1 | US | 2025-06-20 00:00:00 | | 2 | US | 2025-07-10 00:00:00 | | 3 | IN | 2025-08-02 00:00:00 | | 4 | US | 2025-08-25 00:00:00 | | 5 | US | 2025-07-28 00:00:00 | | 6 | BR | 2025-08-15 00:00:00 | +---------+---------+---------------------+
events (only app_open shown) +---------+---------------------+-----------+ | user_id | event_ts | event_type| +---------+---------------------+-----------+ | 1 | 2025-08-05 10:00:00 | app_open | | 2 | 2025-08-10 09:00:00 | app_open | | 2 | 2025-08-28 13:00:00 | app_open | | 3 | 2025-08-03 08:30:00 | app_open | | 4 | 2025-08-30 20:10:00 | app_open | | 5 | 2025-08-12 12:00:00 | app_open | | 6 | 2025-08-31 23:55:00 | app_open | +---------+---------------------+-----------+
orders +----------+---------+---------------------+----------+ | order_id | user_id | order_ts | status | +----------+---------+---------------------+----------+ | 101 | 1 | 2025-08-05 10:05:00 | placed | | 102 | 2 | 2025-08-10 09:05:00 | placed | | 103 | 2 | 2025-08-28 13:05:00 | canceled | | 104 | 3 | 2025-08-03 08:35:00 | placed | | 105 | 4 | 2025-08-30 20:15:00 | placed | | 106 | 5 | 2025-08-12 12:05:00 | placed | | 107 | 6 | 2025-08-31 23:58:00 | placed | +----------+---------+---------------------+----------+
payments +------------+----------+--------+---------------------+----------+ | payment_id | order_id | amount | payment_ts | is_valid | +------------+----------+--------+---------------------+----------+ | 201 | 101 | 50.00 | 2025-08-05 10:06:00 | 1 | | 202 | 102 | 20.00 | 2025-08-10 09:06:00 | 1 | | 203 | 103 | 15.00 | 2025-08-28 13:06:00 | 1 | | 204 | 103 | 15.00 | 2025-08-28 13:06:00 | 0 | -- duplicate/invalid | 205 | 104 | 30.00 | 2025-09-01 00:01:00 | 1 | -- late payment (Sep) | 206 | 105 | 100.00 | 2025-08-30 20:16:00 | 1 | | 207 | 106 | 40.00 | 2025-08-12 12:06:00 | 1 | | 208 | 107 | 25.00 | 2025-09-01 00:10:00 | 1 | -- late payment (Sep) +------------+----------+--------+---------------------+----------+
refunds +-----------+----------+--------+---------------------+ | refund_id | order_id | amount | refund_ts | +-----------+----------+--------+---------------------+ | 301 | 103 | 15.00 | 2025-08-29 10:00:00 | | 302 | 105 | 100.00 | 2025-09-02 09:00:00 | -- refund after month-end | 303 | 106 | 10.00 | 2025-08-20 12:00:00 | +-----------+----------+--------+---------------------+
Task (one Standard SQL query): For calendar month 2025-08, output one row per signup cohort month (cohort_month = DATE_TRUNC(created_at, MONTH)) with: cohort_month, active_users_aug, payers_aug, payer_rate_aug, gmv_aug_usd. Rules and edge cases to handle precisely: