Design an idempotent SQL ETL for late data

Q: Design an idempotent SQL ETL for late data

This is a Data Manipulation (SQL/Python) interview question from Stripe for Data Scientist roles. View the full question and solution on PracHub.

Q: How do I approach Data Manipulation (SQL/Python) interview questions?

Data Manipulation (SQL/Python) questions require understanding of core concepts and practice. PracHub provides solutions with explanations to help you master data manipulation (sql/python) interviews.

Question

You own the daily_user_metrics fact table. Build an idempotent, rerunnable ETL that can be triggered for any date D and correctly handles duplicates, late-arriving records (up to 2 days late), and status changes. You must write whiteboard-level SQL for the core transformation and describe the upsert strategy.

Warehouse (UTC). Raw landing schemas and tiny samples:

Table: users_dim

u_id	created_at	country	is_test
1	2025-08-28 10:00:00	US	0
2	2025-08-30 12:00:00	CA	0
3	2025-08-15 09:00:00	US	1

Table: events_raw

event_id	u_id	event_type	event_ts	ingested_at	source
e1	1	view	2025-09-01 02:03:00	2025-09-01 02:03:05	web
e1	1	view	2025-09-01 02:03:00	2025-09-02 01:00:00	replay (duplicate, late)
e2	1	purchase	2025-09-01 03:10:00	2025-09-01 03:10:04	web
e3	2	view	2025-08-31 23:59:59	2025-09-01 00:00:01	web (late arrival for 08-31)

Table: orders_raw

order_id	u_id	amount	status	order_ts	updated_at	ingested_at
o1	1	100.00	paid	2025-09-01 03:10:00	2025-09-01 03:12:00	2025-09-01 03:12:05
o1	1	100.00	refunded	2025-09-01 03:10:00	2025-09-03 09:00:00	2025-09-03 09:00:10 (late status change)
o2	2	50.00	pending	2025-09-01 20:00:00	2025-09-01 20:01:00	2025-09-01 20:01:05

Target: daily_user_metrics (dt DATE, u_id BIGINT, first_event_ts TIMESTAMP, events_cnt INT, paid_orders_cnt INT, paid_orders_amt DECIMAL(12,2)). Exclude users_dim.is_test = 1.

Tasks (be precise and tricky):

Dedup staging logic. Write SQL CTE(s) that deduplicate events_raw by event_id keeping only the row with the max(ingested_at). Do the same for orders_raw by order_id, keeping the row with the max(updated_at) as the authoritative status snapshot. Explain why dedup by natural keys (event_id/order_id) is safer than relying on ROW_NUMBER over (u_id, event_ts) here.
Daily metric for D=2025-09-01. Using only SQL, produce one row per non-test user with: first_event_ts on D; events_cnt on D; paid_orders_cnt and paid_orders_amt on D where an order counts only if the latest status (per your dedup) is in ('paid','shipped','completed') and not in ('refunded','canceled'). Show the SELECT that computes these fields using your deduped CTEs. Ensure events are filtered by event_ts between [D 00:00:00, D 23:59:59.999] in UTC.
Late data capture window. Assume records for D can arrive up to 2 days late. Describe and write SQL for an incremental approach that, on run date R=D+0, D+1, and D+2, recomputes partitions for [D-2, D] and then MERGEs only the dt=D partition in daily_user_metrics so that late-arriving rows are included exactly once. Show an example MERGE (or INSERT OVERWRITE PARTITION) for dt=D and explain how it remains idempotent on reruns.
Guardrails and failure modes. Enumerate at least five edge cases your ETL must handle (e.g., partial writes/transactionality, schema drift adding a nullable column to events_raw, null event_ts vs non-null ingested_at, daylight saving changes if you later switch to country-local days, replayed backfills that resend old event_ids). For each, state the defensive technique (e.g., write-ahead staging + checksum, schema evolution policy, fallback to ingested_at for partition pruning but event_ts for business logic, surrogate partitioning strategy, MERGE with deterministic dedup).
Validation. Propose two reconciliation queries: one that compares counts and sums between orders_raw (latest status) and daily_user_metrics for D, and one that detects duplicate event_ids that still leaked into the D partition. Include expected results when using the sample data above.

Design an idempotent SQL ETL for late data

Comments (0)