You own a daily ETL + reconciliation job between two financial ledgers. Late postings (“delay time”) up to 48 hours are common. Schema: - payments_raw(payment_id INT, user_id INT, amount DECIMAL(10,2), currency CHAR(3), created_at TIMESTAMP) - ledger_a(txn_id STRING, user_id INT, amount_usd DECIMAL(10,2), posted_at TIMESTAMP) - ledger_b(txn_id STRING, user_id INT, amount_usd DECIMAL(10,2), posted_at TIMESTAMP) - fx_rates(rate_date DATE, currency CHAR(3), usd_rate DECIMAL(10,6)) -- USD per 1 currency unit Small ASCII samples: payments_raw payment_id | user_id | amount | currency | created_at 1 | 101 | 100.00 | USD | 2025-08-30 10:05:00 2 | 102 | 90.00 | EUR | 2025-08-30 16:40:00 3 | 101 | 50.00 | USD | 2025-08-31 23:30:00 4 | 103 | 100.00 | EUR | 2025-09-01 01:20:00 fx_rates rate_date | currency | usd_rate 2025-08-30 | USD | 1.000000 2025-08-30 | EUR | 1.100000 2025-08-31 | USD | 1.000000 2025-08-31 | EUR | 1.120000 2025-09-01 | USD | 1.000000 2025-09-01 | EUR | 1.150000 ledger_a txn_id | user_id | amount_usd | posted_at A-1 | 101 | 100.00 | 2025-08-30 10:06:00 A-2 | 102 | 99.00 | 2025-08-30 18:00:00 A-3 | 103 | 115.00 | 2025-09-01 08:00:00 ledger_b txn_id | user_id | amount_usd | posted_at B-1 | 101 | 100.00 | 2025-09-02 11:00:00 B-2 | 101 | 50.00 | 2025-09-01 00:10:00 B-3 | 102 | 99.01 | 2025-08-30 18:02:00 Tasks: 1) Write a single SQL query that produces a reconciliation report at the payment level with columns: payment_id, user_id, amount_usd_from_fx, a_txn_id, b_txn_id, match_status (one of: matched, late_within_48h, late_beyond_48h, amount_mismatch, missing_in_a, missing_in_b), late_by_hours (NULL if not late), and abs_amount_diff. Rules: (a) Convert payments_raw.amount to USD using the fx rate effective on created_at’s date (use the most recent rate on or before created_at). (b) Attempt to match each payment to at most one txn from each ledger on same user_id, |amount_usd_from_fx − amount_usd| ≤ 0.01, and |posted_at − created_at| ≤ 48 hours for a normal match; if > 48 hours but otherwise matching, classify as late_beyond_48h and compute late_by_hours; if amounts within 0.01 in only one ledger, set the other as missing_in_*. If both ledgers match but amounts differ by > 0.01, classify amount_mismatch. Break ties by the smallest absolute time difference to created_at using window functions. 2) In Python, outline an idempotent daily ETL design that: (a) partitions by event_date (created_at::date), (b) maintains a rolling 72-hour backfill to catch late arrivals, (c) writes a reconciliation snapshot table with a deterministic primary key, and (d) guarantees exactly-once downstream effects. Mention how you would detect and merge late_beyond_48h fixes without duplicating rows. 3) Generate a discrepancy distribution chart. First, produce a pie chart of counts by match_status using matplotlib (or plotly). Then justify why a bar chart with percentages and 95% CIs is superior here and sketch how you’d produce it. Name the exact Python packages you’d use. 4) Defend Python vs R for this pipeline. Be specific about libraries (e.g., pandas/pyarrow/duckdb vs data.table/arrow/dbplyr), deployment/runtime concerns, and interoperability with the rest of a modern data stack.

This question evaluates a candidate's ability to implement data reconciliation and ETL engineering skills, including SQL window functions and joins, time-based FX rate lookups, matching logic for late-arriving events, idempotent daily pipeline design with deterministic keys and exactly-once semantics, and basic statistical charting in Python.

How do I approach Data Manipulation (SQL/Python) interview questions?

Data Manipulation (SQL/Python) questions require understanding of core concepts and practice. PracHub provides solutions with explanations to help you master data manipulation (sql/python) interviews.

What difficulty level is this interview question?

This is a Medium difficulty Data Manipulation (SQL/Python) question, commonly asked during Technical Screen rounds at Capital One.

What role is this question designed for?

This question is commonly asked for Data Scientist candidates at Capital One during technical interviews.

Reconcile ledgers with SQL/Python and late events | Capital One Interview Question

You own a daily ETL + reconciliation job between two financial ledgers. Late postings (“delay time”) up to 48 hours are common. Schema:

payments_raw(payment_id INT, user_id INT, amount DECIMAL(10,2), currency CHAR(3), created_at TIMESTAMP)
ledger_a(txn_id STRING, user_id INT, amount_usd DECIMAL(10,2), posted_at TIMESTAMP)
ledger_b(txn_id STRING, user_id INT, amount_usd DECIMAL(10,2), posted_at TIMESTAMP)
fx_rates(rate_date DATE, currency CHAR(3), usd_rate DECIMAL(10,6)) -- USD per 1 currency unit Small ASCII samples: payments_raw payment_id | user_id | amount | currency | created_at 1 | 101 | 100.00 | USD | 2025-08-30 10:05:00 2 | 102 | 90.00 | EUR | 2025-08-30 16:40:00 3 | 101 | 50.00 | USD | 2025-08-31 23:30:00 4 | 103 | 100.00 | EUR | 2025-09-01 01:20:00

fx_rates rate_date | currency | usd_rate 2025-08-30 | USD | 1.000000 2025-08-30 | EUR | 1.100000 2025-08-31 | USD | 1.000000 2025-08-31 | EUR | 1.120000 2025-09-01 | USD | 1.000000 2025-09-01 | EUR | 1.150000

ledger_a txn_id | user_id | amount_usd | posted_at A-1 | 101 | 100.00 | 2025-08-30 10:06:00 A-2 | 102 | 99.00 | 2025-08-30 18:00:00 A-3 | 103 | 115.00 | 2025-09-01 08:00:00

ledger_b txn_id | user_id | amount_usd | posted_at B-1 | 101 | 100.00 | 2025-09-02 11:00:00 B-2 | 101 | 50.00 | 2025-09-01 00:10:00 B-3 | 102 | 99.01 | 2025-08-30 18:02:00 Tasks:

Write a single SQL query that produces a reconciliation report at the payment level with columns: payment_id, user_id, amount_usd_from_fx, a_txn_id, b_txn_id, match_status (one of: matched, late_within_48h, late_beyond_48h, amount_mismatch, missing_in_a, missing_in_b), late_by_hours (NULL if not late), and abs_amount_diff. Rules: (a) Convert payments_raw.amount to USD using the fx rate effective on created_at’s date (use the most recent rate on or before created_at). (b) Attempt to match each payment to at most one txn from each ledger on same user_id, |amount_usd_from_fx − amount_usd| ≤ 0.01, and |posted_at − created_at| ≤ 48 hours for a normal match; if > 48 hours but otherwise matching, classify as late_beyond_48h and compute late_by_hours; if amounts within 0.01 in only one ledger, set the other as missing_in_*. If both ledgers match but amounts differ by > 0.01, classify amount_mismatch. Break ties by the smallest absolute time difference to created_at using window functions.
In Python, outline an idempotent daily ETL design that: (a) partitions by event_date (created_at::date), (b) maintains a rolling 72-hour backfill to catch late arrivals, (c) writes a reconciliation snapshot table with a deterministic primary key, and (d) guarantees exactly-once downstream effects. Mention how you would detect and merge late_beyond_48h fixes without duplicating rows.
Generate a discrepancy distribution chart. First, produce a pie chart of counts by match_status using matplotlib (or plotly). Then justify why a bar chart with percentages and 95% CIs is superior here and sketch how you’d produce it. Name the exact Python packages you’d use.
Defend Python vs R for this pipeline. Be specific about libraries (e.g., pandas/pyarrow/duckdb vs data.table/arrow/dbplyr), deployment/runtime concerns, and interoperability with the rest of a modern data stack.

You own a daily ETL + reconciliation job between two financial ledgers. Late postings (“delay time”) up to 48 hours are common. Schema:

payments_raw(payment_id INT, user_id INT, amount DECIMAL(10,2), currency CHAR(3), created_at TIMESTAMP)
ledger_a(txn_id STRING, user_id INT, amount_usd DECIMAL(10,2), posted_at TIMESTAMP)
ledger_b(txn_id STRING, user_id INT, amount_usd DECIMAL(10,2), posted_at TIMESTAMP)
fx_rates(rate_date DATE, currency CHAR(3), usd_rate DECIMAL(10,6)) -- USD per 1 currency unit Small ASCII samples: payments_raw payment_id | user_id | amount | currency | created_at 1 | 101 | 100.00 | USD | 2025-08-30 10:05:00 2 | 102 | 90.00 | EUR | 2025-08-30 16:40:00 3 | 101 | 50.00 | USD | 2025-08-31 23:30:00 4 | 103 | 100.00 | EUR | 2025-09-01 01:20:00

ledger_a txn_id | user_id | amount_usd | posted_at A-1 | 101 | 100.00 | 2025-08-30 10:06:00 A-2 | 102 | 99.00 | 2025-08-30 18:00:00 A-3 | 103 | 115.00 | 2025-09-01 08:00:00

ledger_b txn_id | user_id | amount_usd | posted_at B-1 | 101 | 100.00 | 2025-09-02 11:00:00 B-2 | 101 | 50.00 | 2025-09-01 00:10:00 B-3 | 102 | 99.01 | 2025-08-30 18:02:00 Tasks:

Write a single SQL query that produces a reconciliation report at the payment level with columns: payment_id, user_id, amount_usd_from_fx, a_txn_id, b_txn_id, match_status (one of: matched, late_within_48h, late_beyond_48h, amount_mismatch, missing_in_a, missing_in_b), late_by_hours (NULL if not late), and abs_amount_diff. Rules: (a) Convert payments_raw.amount to USD using the fx rate effective on created_at’s date (use the most recent rate on or before created_at). (b) Attempt to match each payment to at most one txn from each ledger on same user_id, |amount_usd_from_fx − amount_usd| ≤ 0.01, and |posted_at − created_at| ≤ 48 hours for a normal match; if > 48 hours but otherwise matching, classify as late_beyond_48h and compute late_by_hours; if amounts within 0.01 in only one ledger, set the other as missing_in_*. If both ledgers match but amounts differ by > 0.01, classify amount_mismatch. Break ties by the smallest absolute time difference to created_at using window functions.
In Python, outline an idempotent daily ETL design that: (a) partitions by event_date (created_at::date), (b) maintains a rolling 72-hour backfill to catch late arrivals, (c) writes a reconciliation snapshot table with a deterministic primary key, and (d) guarantees exactly-once downstream effects. Mention how you would detect and merge late_beyond_48h fixes without duplicating rows.
Generate a discrepancy distribution chart. First, produce a pie chart of counts by match_status using matplotlib (or plotly). Then justify why a bar chart with percentages and 95% CIs is superior here and sketch how you’d produce it. Name the exact Python packages you’d use.
Defend Python vs R for this pipeline. Be specific about libraries (e.g., pandas/pyarrow/duckdb vs data.table/arrow/dbplyr), deployment/runtime concerns, and interoperability with the rest of a modern data stack.

Reconcile ledgers with SQL/Python and late events

Quick Overview

Comments (0)

Reconcile ledgers with SQL/Python and late events

Quick Overview

Comments (0)