Merge seven tables into one clean DataFrame

Q: How do I practice SQL interview questions?

PracHub provides an interactive SQL console where you can write and test queries against real database schemas. Get instant feedback and compare your solution with the expected output.

Q: What difficulty level is this coding question?

This is a medium difficulty Data Manipulation (SQL/Python) question, commonly asked during HR Screen rounds at Capital One.

Q: What role is this question designed for?

This question is commonly asked for Data Scientist candidates at Capital One during technical interviews.

Question

Using pandas only (no loops over rows), write a function build_facts(customers, orders, order_items, products, payments, shipments, refunds) -> pd.DataFrame that returns a single denormalized table with exactly these columns in this order: ["order_item_id","order_id","customer_id","order_ts","paid_ts","shipped_ts","product_id","sku","category","qty","unit_price","line_amount","refund_amount","country","status","payment_method","carrier"]. Input tables (columns): customers(customer_id:int, signup_date:datetime, country:str); orders(order_id:int, customer_id:int, order_ts:datetime, status:str); order_items(order_item_id:int, order_id:int, product_id:int, qty:int, unit_price:float); products(product_id:int, sku:str, category:str); payments(payment_id:int, order_id:int, paid_ts:datetime, method:str, amount:float); shipments(shipment_id:int, order_id:int, shipped_ts:datetime, carrier:str, ship_cost:float); refunds(refund_id:int, order_item_id:int, refund_ts:datetime, refund_amount:float, reason:str). Rules: one row per order_item_id; line_amount = qty * unit_price; refund_amount = sum of refund_amount per order_item_id (0.0 if none); paid_ts = max(paid_ts) per order; payment_method = method associated with the largest payment amount for the order (break ties by lexicographically smallest method); shipped_ts = earliest shipped_ts per order (NaT if none); left‑join so every order_item_id in order_items appears exactly once; ensure datetime dtypes for *_ts, numeric dtypes for qty/unit_price/line_amount/refund_amount; output must contain no extra or missing columns and exact lower_snake_case names. Provide brief unit‑test style assertions for row count preservation, sorted column order, and absence of duplicate order_item_id.

PracHub · Accepted Answer

This question evaluates proficiency with pandas-based data manipulation, including multi-table joins, aggregations, dtype enforcement, denormalization, and verification of row/column integrity within the Data Manipulation (SQL/Python) domain for a Data Scientist role.

Quick Overview

Quick Overview