PracHub
QuestionsPremiumLearningGuidesCheatsheetNEWCoaches
|Home/Data Manipulation (SQL/Python)/Amazon

Design idempotent daily loads with deduping

Last updated: Mar 29, 2026

Quick Overview

This question evaluates data engineering competencies including idempotent ETL design, deduplication and upsert semantics, handling late-arriving updates, partitioning and incremental watermarks, load auditing and data-quality checks using SQL and orchestration with Python.

  • Medium
  • Amazon
  • Data Manipulation (SQL/Python)
  • Data Scientist

Design idempotent daily loads with deduping

Company: Amazon

Role: Data Scientist

Category: Data Manipulation (SQL/Python)

Difficulty: Medium

Interview Round: Technical Screen

You need to load the last 7 days of orders into a large fact table from a noisy staging feed. Assume today is 2025-09-01. Requirements: idempotent reruns, late-arriving updates, and duplicate detection. Invent a complete approach (DDL + SQL/Python) that: 1) dedupes staging by (order_id, updated_at) keeping the latest; 2) MERGEs into the fact table only when the incoming row is newer; 3) detects whether a row was already loaded; 4) supports partitioning and incremental watermarks; 5) captures load audit metrics and data-quality checks (row counts, nulls, referential integrity). Use the schema and tiny sample data below, then write the exact SQL for the dedupe CTE and the MERGE (or equivalent upsert), plus a brief Python snippet (pseudo-code OK) for the orchestration and idempotent checkpointing. Schema: - staging_orders(order_id INT, customer_id INT, order_date DATE, updated_at TIMESTAMP, total_amount DECIMAL(10,2), source_file STRING, ingest_time TIMESTAMP) - fact_orders(order_id INT PRIMARY KEY, customer_id INT, order_date DATE, updated_at TIMESTAMP, total_amount DECIMAL(10,2), load_batch_id STRING, loaded_at TIMESTAMP) - dim_customer(customer_id INT PRIMARY KEY, signup_date DATE) - loads_audit(batch_id STRING PRIMARY KEY, run_start TIMESTAMP, run_end TIMESTAMP, watermark_from DATE, watermark_to DATE, staged_rows INT, deduped_rows INT, inserted_rows INT, updated_rows INT, dq_errors INT) Sample rows (monospaced ASCII): staging_orders (arriving 2025-09-01 for last 7 days) | order_id | customer_id | order_date | updated_at | total_amount | source_file | ingest_time | |---------:|------------:|-------------|----------------------|-------------:|------------------|----------------------| | 100 | 1 | 2025-08-26 | 2025-08-26 10:00:00 | 49.00 | orders_20250826 | 2025-09-01 00:05:00 | | 101 | 2 | 2025-08-28 | 2025-08-28 09:00:00 | 20.00 | orders_20250828 | 2025-09-01 00:05:10 | | 101 | 2 | 2025-08-28 | 2025-08-30 12:30:00 | 22.00 | latefix_20250830 | 2025-09-01 00:05:10 | | 102 | 3 | 2025-08-31 | 2025-08-31 21:00:00 | 15.50 | orders_20250831 | 2025-09-01 00:05:20 | | 103 | 9 | 2025-09-01 | 2025-09-01 00:01:00 | 105.00 | orders_20250901 | 2025-09-01 00:05:30 | fact_orders (before load) | order_id | customer_id | order_date | updated_at | total_amount | load_batch_id | loaded_at | |---------:|------------:|-------------|----------------------|-------------:|---------------|----------------------| | 101 | 2 | 2025-08-28 | 2025-08-28 09:00:00 | 20.00 | batch_0828 | 2025-08-28 10:00:00 | dim_customer | customer_id | signup_date | |------------:|-------------| | 1 | 2025-05-01 | | 2 | 2024-11-11 | | 3 | 2025-08-01 | Answer specifics: a) show the exact SQL to dedupe staging (window function or aggregate); b) show a MERGE that updates when s.updated_at > t.updated_at and inserts if not matched; c) propose a robust watermark (e.g., max(order_date) − 1 day) to catch late data, and where you store it; d) show how you’d compute and store a row hash to detect prior loads; e) list two data-quality assertions that should fail the batch (and how to rollback safely).

Quick Answer: This question evaluates data engineering competencies including idempotent ETL design, deduplication and upsert semantics, handling late-arriving updates, partitioning and incremental watermarks, load auditing and data-quality checks using SQL and orchestration with Python.

Related Interview Questions

  • Find recommended friend pairs by shared songs - Amazon (medium)
  • Find recommended friend pairs by shared listening - Amazon (easy)
  • Write SQL window functions for D7 retention - Amazon (medium)
  • Find daily first-order merchants with SQL - Amazon (Medium)
  • Design student–course data models and SQL - Amazon (Medium)
Amazon logo
Amazon
Oct 13, 2025, 9:49 PM
Data Scientist
Technical Screen
Data Manipulation (SQL/Python)
1
0

You need to load the last 7 days of orders into a large fact table from a noisy staging feed. Assume today is 2025-09-01. Requirements: idempotent reruns, late-arriving updates, and duplicate detection. Invent a complete approach (DDL + SQL/Python) that: 1) dedupes staging by (order_id, updated_at) keeping the latest; 2) MERGEs into the fact table only when the incoming row is newer; 3) detects whether a row was already loaded; 4) supports partitioning and incremental watermarks; 5) captures load audit metrics and data-quality checks (row counts, nulls, referential integrity). Use the schema and tiny sample data below, then write the exact SQL for the dedupe CTE and the MERGE (or equivalent upsert), plus a brief Python snippet (pseudo-code OK) for the orchestration and idempotent checkpointing.

Schema:

  • staging_orders(order_id INT, customer_id INT, order_date DATE, updated_at TIMESTAMP, total_amount DECIMAL(10,2), source_file STRING, ingest_time TIMESTAMP)
  • fact_orders(order_id INT PRIMARY KEY, customer_id INT, order_date DATE, updated_at TIMESTAMP, total_amount DECIMAL(10,2), load_batch_id STRING, loaded_at TIMESTAMP)
  • dim_customer(customer_id INT PRIMARY KEY, signup_date DATE)
  • loads_audit(batch_id STRING PRIMARY KEY, run_start TIMESTAMP, run_end TIMESTAMP, watermark_from DATE, watermark_to DATE, staged_rows INT, deduped_rows INT, inserted_rows INT, updated_rows INT, dq_errors INT)

Sample rows (monospaced ASCII): staging_orders (arriving 2025-09-01 for last 7 days)

order_idcustomer_idorder_dateupdated_attotal_amountsource_fileingest_time
10012025-08-262025-08-26 10:00:0049.00orders_202508262025-09-01 00:05:00
10122025-08-282025-08-28 09:00:0020.00orders_202508282025-09-01 00:05:10
10122025-08-282025-08-30 12:30:0022.00latefix_202508302025-09-01 00:05:10
10232025-08-312025-08-31 21:00:0015.50orders_202508312025-09-01 00:05:20
10392025-09-012025-09-01 00:01:00105.00orders_202509012025-09-01 00:05:30

fact_orders (before load)

order_idcustomer_idorder_dateupdated_attotal_amountload_batch_idloaded_at
10122025-08-282025-08-28 09:00:0020.00batch_08282025-08-28 10:00:00

dim_customer

customer_idsignup_date
12025-05-01
22024-11-11
32025-08-01

Answer specifics: a) show the exact SQL to dedupe staging (window function or aggregate); b) show a MERGE that updates when s.updated_at > t.updated_at and inserts if not matched; c) propose a robust watermark (e.g., max(order_date) − 1 day) to catch late data, and where you store it; d) show how you’d compute and store a row hash to detect prior loads; e) list two data-quality assertions that should fail the batch (and how to rollback safely).

Comments (0)

Sign in to leave a comment

Loading comments...

Browse More Questions

More Data Manipulation (SQL/Python)•More Amazon•More Data Scientist•Amazon Data Scientist•Amazon Data Manipulation (SQL/Python)•Data Scientist Data Manipulation (SQL/Python)
PracHub

Master your tech interviews with 7,500+ real questions from top companies.

Product

  • Questions
  • Learning Tracks
  • Interview Guides
  • Resources
  • Premium
  • For Universities
  • Student Access

Browse

  • By Company
  • By Role
  • By Category
  • Topic Hubs
  • SQL Questions
  • Compare Platforms
  • Discord Community

Support

  • support@prachub.com
  • (916) 541-4762

Legal

  • Privacy Policy
  • Terms of Service
  • About Us

© 2026 PracHub. All rights reserved.