PracHub
QuestionsPremiumLearningGuidesInterview PrepNEWCoaches

Quick Overview

This question evaluates proficiency with pandas-based data manipulation, including multi-table joins, aggregations, dtype enforcement, denormalization, and verification of row/column integrity within the Data Manipulation (SQL/Python) domain for a Data Scientist role.

  • Medium
  • Capital One
  • Data Manipulation (SQL/Python)
  • Data Scientist

Merge seven tables into one clean DataFrame

Company: Capital One

Role: Data Scientist

Category: Data Manipulation (SQL/Python)

Difficulty: Medium

Interview Round: HR Screen

Using pandas only (no loops over rows), write a function build_facts(customers, orders, order_items, products, payments, shipments, refunds) -> pd.DataFrame that returns a single denormalized table with exactly these columns in this order: ["order_item_id","order_id","customer_id","order_ts","paid_ts","shipped_ts","product_id","sku","category","qty","unit_price","line_amount","refund_amount","country","status","payment_method","carrier"]. Input tables (columns): customers(customer_id:int, signup_date:datetime, country:str); orders(order_id:int, customer_id:int, order_ts:datetime, status:str); order_items(order_item_id:int, order_id:int, product_id:int, qty:int, unit_price:float); products(product_id:int, sku:str, category:str); payments(payment_id:int, order_id:int, paid_ts:datetime, method:str, amount:float); shipments(shipment_id:int, order_id:int, shipped_ts:datetime, carrier:str, ship_cost:float); refunds(refund_id:int, order_item_id:int, refund_ts:datetime, refund_amount:float, reason:str). Rules: one row per order_item_id; line_amount = qty * unit_price; refund_amount = sum of refund_amount per order_item_id (0.0 if none); paid_ts = max(paid_ts) per order; payment_method = method associated with the largest payment amount for the order (break ties by lexicographically smallest method); shipped_ts = earliest shipped_ts per order (NaT if none); left‑join so every order_item_id in order_items appears exactly once; ensure datetime dtypes for *_ts, numeric dtypes for qty/unit_price/line_amount/refund_amount; output must contain no extra or missing columns and exact lower_snake_case names. Provide brief unit‑test style assertions for row count preservation, sorted column order, and absence of duplicate order_item_id.

Quick Answer: This question evaluates proficiency with pandas-based data manipulation, including multi-table joins, aggregations, dtype enforcement, denormalization, and verification of row/column integrity within the Data Manipulation (SQL/Python) domain for a Data Scientist role.

Last updated: Mar 29, 2026

Loading coding console...

PracHub

Master your tech interviews with 7,500+ real questions from top companies.

Product

  • Questions
  • Learning Tracks
  • Interview Guides
  • Resources
  • Premium
  • For Universities
  • Student Access

Browse

  • By Company
  • By Role
  • By Category
  • Topic Hubs
  • SQL Questions
  • Compare Platforms
  • Discord Community

Support

  • support@prachub.com
  • (916) 541-4762

Legal

  • Privacy Policy
  • Terms of Service
  • About Us

© 2026 PracHub. All rights reserved.

Related Coding Questions

  • Clean and Merge Housing Data - Capital One (easy)
  • Find Lowest Prices for Highly Rated Categories - Capital One (medium)
  • Write SQL to compute campaign net revenue - Capital One (Medium)
  • Merge CSVs and build revenue pivot with pandas - Capital One (Medium)
  • Find top category per region in Aug 2025 - Capital One (Medium)