##### Question
You are given a transaction-level table and must compute a weekly time series of how many customers have reached a year-to-date spending threshold. The problem can be solved in **SQL** or **Python (pandas)**.
## Table
**transactions**
- `date` (DATE) — transaction date (assume UTC, calendar dates)
- `transaction_id` (STRING/INT) — primary key
- `customer_id` (STRING/INT)
- `dollars` (NUMERIC) — non-negative spend amount in USD
## Definitions
- **Week**: a calendar week starting on Monday. Report each week by its `week_start_date` (the Monday of the week). You may pick a different, consistent week boundary if you state it.
- **Year-to-date (YTD) spend as of a week**: for a given `customer_id` in calendar year `Y`, the sum of `dollars` from **Jan 1 of year Y** up to and including the **last day of that week**.
## Task
For each week present in the data, compute the **number of distinct customers** whose **YTD spend as of that week** is **at least $1000**.
1. **Core metric.** For every week, count the distinct customers whose cumulative spend from the start of that calendar year through the end of the week is ≥ $1000.
2. **Multi-year handling.** If the data spans multiple calendar years, compute the metric **independently per year** — YTD resets to $0 on Jan 1 of each year. Include the `year` in the output.
3. **Ordering.** Order the results by `year`, then `week_start_date`.
## Output
Return a table with:
- `year` (INT)
- `week_start_date` (DATE)
- `num_customers_ge_1000_ytd` (INT)
(If you assume a single year, the original minimal output of `week_start` and `num_customers_ge_1000_ytd` is also acceptable — the `year` column is what generalizes it to multiple years.)
Quick Answer: Databricks data-scientist technical-screen question: given a transactions table, compute for each calendar week the number of distinct customers whose year-to-date spend reaches at least $1000, resetting YTD each January and supporting multiple years. It tests running-total window functions, per-customer-then-threshold logic, Monday-week bucketing, and distinct counts in SQL or pandas.
Solution
The core trap is that a customer who crosses $1000 stays counted for the rest of the year — so you cannot simply tally per-week new-spenders, and you must accumulate spend *per customer* before thresholding, not aggregate first and threshold a running total of the aggregate.
**Approach (two stages):**
1. **Per-customer, per-week cumulative YTD.** Bucket each transaction into its `(year, week_start)` and sum `dollars` per `(customer_id, year, week)`. Then take a running cumulative sum over weeks within each `(customer_id, year)`, partitioned by year so YTD resets every Jan 1.
2. **Threshold then count distinct.** For each `(year, week)`, count how many customers have cumulative YTD ≥ 1000 as of that week. Because the cumulative sum is monotonically non-decreasing within a year, once a customer reaches the threshold they remain counted in every later week of that year.
A subtlety: a customer only has rows for weeks in which they transacted. To count them in *intervening* weeks where they didn't spend (but their YTD is still ≥ 1000), you must carry their cumulative value forward across all weeks of the year. The cleanest way is to evaluate, for each week present in the data, every customer whose first week of reaching ≥ $1000 is ≤ that week.
**SQL (window functions):**
```sql
WITH per_week AS ( -- spend per customer per week
SELECT
customer_id,
EXTRACT(YEAR FROM date)::int AS yr,
DATE_TRUNC('week', date)::date AS week_start, -- Monday
SUM(dollars) AS week_spend
FROM transactions
GROUP BY 1, 2, 3
),
cum AS ( -- running YTD per customer within each year
SELECT
customer_id, yr, week_start,
SUM(week_spend) OVER (
PARTITION BY customer_id, yr
ORDER BY week_start
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
) AS ytd_spend
FROM per_week
),
threshold_week AS ( -- first week each customer hits >= 1000, per year
SELECT customer_id, yr, MIN(week_start) AS hit_week
FROM cum
WHERE ytd_spend >= 1000
GROUP BY customer_id, yr
),
weeks AS ( -- every (year, week) present in the data
SELECT DISTINCT yr, week_start FROM per_week
)
SELECT
w.yr AS year,
w.week_start AS week_start_date,
COUNT(DISTINCT t.customer_id) AS num_customers_ge_1000_ytd
FROM weeks w
LEFT JOIN threshold_week t
ON t.yr = w.yr
AND t.hit_week <= w.week_start -- counted from the week they crossed onward
GROUP BY w.yr, w.week_start
ORDER BY w.yr, w.week_start;
```
The `threshold_week` CTE collapses each customer to the first week they reach $1000 in a given year; joining `hit_week <= week_start` then counts them in that week and every later week of the same year automatically — no forward-fill of empty weeks needed. `DATE_TRUNC('week', ...)` returns the Monday in Postgres/Spark SQL (Databricks); adjust the bucketing function for other engines.
**Python (pandas):**
```python
import pandas as pd
def weekly_ge_1000(df: pd.DataFrame) -> pd.DataFrame:
df = df.copy()
df['date'] = pd.to_datetime(df['date'])
df['year'] = df['date'].dt.isocalendar().year # ISO year aligns with Monday weeks
# Monday of the week containing each date:
df['week_start'] = (df['date'] - pd.to_timedelta(df['date'].dt.weekday, unit='D')).dt.normalize()
# 1) per customer, per (year, week) spend
wk = (df.groupby(['customer_id', 'year', 'week_start'], as_index=False)['dollars']
.sum()
.sort_values(['customer_id', 'year', 'week_start']))
# 2) running YTD per customer within each year
wk['ytd'] = wk.groupby(['customer_id', 'year'])['dollars'].cumsum()
# 3) first week each customer crosses 1000 (per year)
hit = (wk[wk['ytd'] >= 1000]
.groupby(['customer_id', 'year'], as_index=False)['week_start'].min()
.rename(columns={'week_start': 'hit_week'}))
# 4) for every (year, week) in the data, count customers whose hit_week <= week
weeks = wk[['year', 'week_start']].drop_duplicates()
merged = weeks.merge(hit, on='year', how='left')
counted = merged[merged['hit_week'] <= merged['week_start']]
out = (counted.groupby(['year', 'week_start'])['customer_id']
.nunique()
.reset_index(name='num_customers_ge_1000_ytd'))
# weeks where nobody has crossed yet should still appear with 0
out = (weeks.merge(out, on=['year', 'week_start'], how='left')
.fillna({'num_customers_ge_1000_ytd': 0}))
out['num_customers_ge_1000_ytd'] = out['num_customers_ge_1000_ytd'].astype(int)
return (out.rename(columns={'week_start': 'week_start_date'})
.sort_values(['year', 'week_start_date'])
.reset_index(drop=True))
```
**Points the interviewer probes:**
- **Accumulate per customer, then threshold — not the reverse.** Thresholding a running total of *aggregate* spend, or summing per-week distinct counts, both give wrong answers. The threshold is per customer, the distinct count is per week.
- **Stickiness.** Once a customer is ≥ $1000 in a year, they stay counted for all later weeks that year even with no further spend. The `hit_week <= week_start` join (or a forward-fill of the cumulative value) captures this; a naive per-week filter on rows that exist would undercount the quiet weeks.
- **YTD resets per year.** Partition the running sum by `(customer_id, year)` so Jan 1 zeroes it out; don't let December carry into January.
- **Consistent week boundary.** Pick Monday (or any boundary) and bucket every date the same way; mismatched bucketing between the cumulative and the reporting step silently corrupts counts. Using ISO year keeps the Monday-week and the year consistent at year boundaries.
- **Distinct customers.** Use `COUNT(DISTINCT customer_id)` / `nunique`; a customer with many transactions in a week must count once.
- **Complexity.** O(N log N) dominated by sorting/grouping the transactions; the window cumulative sum and the final group-by are linear in the number of `(customer, week)` rows.
Explanation
This screen tests cumulative (running-total) aggregation, per-year partitioning that resets YTD, date/week bucketing to Monday, and a distinct count over a sticky threshold. The key insight separating a correct answer from a plausible-but-wrong one: spend must be accumulated per customer and the $1000 threshold applied per customer, after which you count distinct customers per week — and a customer who crosses the threshold remains counted for every later week of that year, including weeks with no transactions. A strong candidate handles the multi-year reset, picks a consistent week boundary, and reasons about why summing per-week distinct counts or thresholding an aggregate running total is incorrect.