This question evaluates data engineering competencies including idempotent ETL design, deduplication and upsert semantics, handling late-arriving updates, partitioning and incremental watermarks, load auditing and data-quality checks using SQL and orchestration with Python.
You need to load the last 7 days of orders into a large fact table from a noisy staging feed. Assume today is 2025-09-01. Requirements: idempotent reruns, late-arriving updates, and duplicate detection. Invent a complete approach (DDL + SQL/Python) that: 1) dedupes staging by (order_id, updated_at) keeping the latest; 2) MERGEs into the fact table only when the incoming row is newer; 3) detects whether a row was already loaded; 4) supports partitioning and incremental watermarks; 5) captures load audit metrics and data-quality checks (row counts, nulls, referential integrity). Use the schema and tiny sample data below, then write the exact SQL for the dedupe CTE and the MERGE (or equivalent upsert), plus a brief Python snippet (pseudo-code OK) for the orchestration and idempotent checkpointing.
Schema:
Sample rows (monospaced ASCII): staging_orders (arriving 2025-09-01 for last 7 days)
| order_id | customer_id | order_date | updated_at | total_amount | source_file | ingest_time |
|---|---|---|---|---|---|---|
| 100 | 1 | 2025-08-26 | 2025-08-26 10:00:00 | 49.00 | orders_20250826 | 2025-09-01 00:05:00 |
| 101 | 2 | 2025-08-28 | 2025-08-28 09:00:00 | 20.00 | orders_20250828 | 2025-09-01 00:05:10 |
| 101 | 2 | 2025-08-28 | 2025-08-30 12:30:00 | 22.00 | latefix_20250830 | 2025-09-01 00:05:10 |
| 102 | 3 | 2025-08-31 | 2025-08-31 21:00:00 | 15.50 | orders_20250831 | 2025-09-01 00:05:20 |
| 103 | 9 | 2025-09-01 | 2025-09-01 00:01:00 | 105.00 | orders_20250901 | 2025-09-01 00:05:30 |
fact_orders (before load)
| order_id | customer_id | order_date | updated_at | total_amount | load_batch_id | loaded_at |
|---|---|---|---|---|---|---|
| 101 | 2 | 2025-08-28 | 2025-08-28 09:00:00 | 20.00 | batch_0828 | 2025-08-28 10:00:00 |
dim_customer
| customer_id | signup_date |
|---|---|
| 1 | 2025-05-01 |
| 2 | 2024-11-11 |
| 3 | 2025-08-01 |
Answer specifics: a) show the exact SQL to dedupe staging (window function or aggregate); b) show a MERGE that updates when s.updated_at > t.updated_at and inserts if not matched; c) propose a robust watermark (e.g., max(order_date) − 1 day) to catch late data, and where you store it; d) show how you’d compute and store a row hash to detect prior loads; e) list two data-quality assertions that should fail the batch (and how to rollback safely).