You own the daily_user_metrics fact table. Build an idempotent, rerunnable ETL that can be triggered for any date D and correctly handles duplicates, late-arriving records (up to 2 days late), and status changes. You must write whiteboard-level SQL for the core transformation and describe the upsert strategy.
Warehouse (UTC). Raw landing schemas and tiny samples:
Table: users_dim
| u_id | created_at | country | is_test |
|---|
| 1 | 2025-08-28 10:00:00 | US | 0 |
| 2 | 2025-08-30 12:00:00 | CA | 0 |
| 3 | 2025-08-15 09:00:00 | US | 1 |
Table: events_raw
| event_id | u_id | event_type | event_ts | ingested_at | source |
|---|
| e1 | 1 | view | 2025-09-01 02:03:00 | 2025-09-01 02:03:05 | web |
| e1 | 1 | view | 2025-09-01 02:03:00 | 2025-09-02 01:00:00 | replay (duplicate, late) |
| e2 | 1 | purchase | 2025-09-01 03:10:00 | 2025-09-01 03:10:04 | web |
| e3 | 2 | view | 2025-08-31 23:59:59 | 2025-09-01 00:00:01 | web (late arrival for 08-31) |
Table: orders_raw
| order_id | u_id | amount | status | order_ts | updated_at | ingested_at |
|---|
| o1 | 1 | 100.00 | paid | 2025-09-01 03:10:00 | 2025-09-01 03:12:00 | 2025-09-01 03:12:05 |
| o1 | 1 | 100.00 | refunded | 2025-09-01 03:10:00 | 2025-09-03 09:00:00 | 2025-09-03 09:00:10 (late status change) |
| o2 | 2 | 50.00 | pending | 2025-09-01 20:00:00 | 2025-09-01 20:01:00 | 2025-09-01 20:01:05 |
Target: daily_user_metrics (dt DATE, u_id BIGINT, first_event_ts TIMESTAMP, events_cnt INT, paid_orders_cnt INT, paid_orders_amt DECIMAL(12,2)). Exclude users_dim.is_test = 1.
Tasks (be precise and tricky):
-
Dedup staging logic. Write SQL CTE(s) that deduplicate events_raw by event_id keeping only the row with the max(ingested_at). Do the same for orders_raw by order_id, keeping the row with the max(updated_at) as the authoritative status snapshot. Explain why dedup by natural keys (event_id/order_id) is safer than relying on ROW_NUMBER over (u_id, event_ts) here.
-
Daily metric for D=2025-09-01. Using only SQL, produce one row per non-test user with: first_event_ts on D; events_cnt on D; paid_orders_cnt and paid_orders_amt on D where an order counts only if the latest status (per your dedup) is in ('paid','shipped','completed') and not in ('refunded','canceled'). Show the SELECT that computes these fields using your deduped CTEs. Ensure events are filtered by event_ts between [D 00:00:00, D 23:59:59.999] in UTC.
-
Late data capture window. Assume records for D can arrive up to 2 days late. Describe and write SQL for an incremental approach that, on run date R=D+0, D+1, and D+2, recomputes partitions for [D-2, D] and then MERGEs only the dt=D partition in daily_user_metrics so that late-arriving rows are included exactly once. Show an example MERGE (or INSERT OVERWRITE PARTITION) for dt=D and explain how it remains idempotent on reruns.
-
Guardrails and failure modes. Enumerate at least five edge cases your ETL must handle (e.g., partial writes/transactionality, schema drift adding a nullable column to events_raw, null event_ts vs non-null ingested_at, daylight saving changes if you later switch to country-local days, replayed backfills that resend old event_ids). For each, state the defensive technique (e.g., write-ahead staging + checksum, schema evolution policy, fallback to ingested_at for partition pruning but event_ts for business logic, surrogate partitioning strategy, MERGE with deterministic dedup).
-
Validation. Propose two reconciliation queries: one that compares counts and sums between orders_raw (latest status) and daily_user_metrics for D, and one that detects duplicate event_ids that still leaked into the D partition. Include expected results when using the sample data above.