You own a daily ETL + reconciliation job between two financial ledgers. Late postings (“delay time”) up to 48 hours are common.
Schema:
-
payments_raw(payment_id INT, user_id INT, amount DECIMAL(10,2), currency CHAR(3), created_at TIMESTAMP)
-
ledger_a(txn_id STRING, user_id INT, amount_usd DECIMAL(10,2), posted_at TIMESTAMP)
-
ledger_b(txn_id STRING, user_id INT, amount_usd DECIMAL(10,2), posted_at TIMESTAMP)
-
fx_rates(rate_date DATE, currency CHAR(3), usd_rate DECIMAL(10,6)) -- USD per 1 currency unit
Small ASCII samples:
payments_raw
payment_id | user_id | amount | currency | created_at
1 | 101 | 100.00 | USD | 2025-08-30 10:05:00
2 | 102 | 90.00 | EUR | 2025-08-30 16:40:00
3 | 101 | 50.00 | USD | 2025-08-31 23:30:00
4 | 103 | 100.00 | EUR | 2025-09-01 01:20:00
fx_rates
rate_date | currency | usd_rate
2025-08-30 | USD | 1.000000
2025-08-30 | EUR | 1.100000
2025-08-31 | USD | 1.000000
2025-08-31 | EUR | 1.120000
2025-09-01 | USD | 1.000000
2025-09-01 | EUR | 1.150000
ledger_a
txn_id | user_id | amount_usd | posted_at
A-1 | 101 | 100.00 | 2025-08-30 10:06:00
A-2 | 102 | 99.00 | 2025-08-30 18:00:00
A-3 | 103 | 115.00 | 2025-09-01 08:00:00
ledger_b
txn_id | user_id | amount_usd | posted_at
B-1 | 101 | 100.00 | 2025-09-02 11:00:00
B-2 | 101 | 50.00 | 2025-09-01 00:10:00
B-3 | 102 | 99.01 | 2025-08-30 18:02:00
Tasks:
-
Write a single SQL query that produces a reconciliation report at the payment level with columns: payment_id, user_id, amount_usd_from_fx, a_txn_id, b_txn_id, match_status (one of: matched, late_within_48h, late_beyond_48h, amount_mismatch, missing_in_a, missing_in_b), late_by_hours (NULL if not late), and abs_amount_diff. Rules: (a) Convert payments_raw.amount to USD using the fx rate effective on created_at’s date (use the most recent rate on or before created_at). (b) Attempt to match each payment to at most one txn from each ledger on same user_id, |amount_usd_from_fx − amount_usd| ≤ 0.01, and |posted_at − created_at| ≤ 48 hours for a normal match; if > 48 hours but otherwise matching, classify as late_beyond_48h and compute late_by_hours; if amounts within 0.01 in only one ledger, set the other as missing_in_*. If both ledgers match but amounts differ by > 0.01, classify amount_mismatch. Break ties by the smallest absolute time difference to created_at using window functions.
-
In Python, outline an idempotent daily ETL design that: (a) partitions by event_date (created_at::date), (b) maintains a rolling 72-hour backfill to catch late arrivals, (c) writes a reconciliation snapshot table with a deterministic primary key, and (d) guarantees exactly-once downstream effects. Mention how you would detect and merge late_beyond_48h fixes without duplicating rows.
-
Generate a discrepancy distribution chart. First, produce a pie chart of counts by match_status using matplotlib (or plotly). Then justify why a bar chart with percentages and 95% CIs is superior here and sketch how you’d produce it. Name the exact Python packages you’d use.
-
Defend Python vs R for this pipeline. Be specific about libraries (e.g., pandas/pyarrow/duckdb vs data.table/arrow/dbplyr), deployment/runtime concerns, and interoperability with the rest of a modern data stack.