Unify 7 tables and impute missing values

Q: How do I practice SQL interview questions?

PracHub provides an interactive SQL console where you can write and test queries against real database schemas. Get instant feedback and compare your solution with the expected output.

Q: What difficulty level is this coding question?

This is a Medium difficulty Data Manipulation (SQL/Python) question, commonly asked during Take-home Project rounds at Boston Consulting Group.

Q: What role is this question designed for?

This question is commonly asked for Data Scientist candidates at Boston Consulting Group during technical interviews.

Question

Using pandas, write a robust function unify_orders(...) that ingests seven dataframes (or CSVs) with possibly inconsistent column casing/whitespace and returns a single denormalized OrdersAnalytics table with exact columns and order: [order_id, order_date, customer_id, customer_name, shipper_name, total_amount, product_count, category_list, payment_status]. Rules: - order_date must be a YYYY-MM-DD string; - total_amount is sum(quantity*unit_price) across items per order; - product_count is count of distinct product_id per order; - category_list is ';'-joined, deduplicated, alphabetically sorted category_name per order; - Keep orders even if shipper or payment is missing (shipper_name may be null; payment_status becomes 'unknown'); - No extra/missing columns; assert returned_df.columns == [...]. Handle missing values: unit_price imputed by product-level median; if unavailable, use category-level median; if still missing, use global median across order_items. quantity missing -> impute 1. Payment amount missing -> recompute from items; payment_status missing -> 'unknown'. Normalize column names to snake_case and strip cell whitespace before processing. Provide O(N log N) or better joins and avoid quadratic loops. Use the following small ASCII samples to illustrate joins and expected aggregation behavior (you do not need to hardcode these): customers: +-------------+-------------+ | customer_id | name | +-------------+-------------+ | 1 | Ada Lovelace| | 2 | A. Turing | +-------------+-------------+ orders: +----------+-------------+------------+ | order_id | customer_id | shipper_id | +----------+-------------+------------+ | 10 | 1 | 100 | | 11 | 1 | null | | 12 | 2 | 101 | +----------+-------------+------------+ (OrderDate column may appear as 'OrderDate' or 'order_date' in files; assume values: 2025-05-01 for 10, 2025-05-03 for 11, 2025-05-04 for 12.) order_items: +----------+------------+----------+------------+ | order_id | product_id | quantity | unit_price | +----------+------------+----------+------------+ | 10 | 501 | 2 | 30.0 | | 10 | 502 | null | 10.0 | | 11 | 501 | 1 | null | | 12 | 503 | 3 | 7.5 | +----------+------------+----------+------------+ products: +------------+--------------+-------------+ | product_id | product_name | category_id | +------------+--------------+-------------+ | 501 | Widget A | 9001 | | 502 | Gadget B | 9002 | | 503 | Gizmo C | 9001 | +------------+--------------+-------------+ categories: +-------------+---------------+ | category_id | category_name | +-------------+---------------+ | 9001 | Tools | | 9002 | Accessories | +-------------+---------------+ shippers: +------------+--------------+ | shipper_id | shipper_name | +------------+--------------+ | 100 | FastShip | | 101 | SureShip | +------------+--------------+ payments: +----------+----------------+--------+ | order_id | payment_status | amount | +----------+----------------+--------+ | 10 | paid | 70.0 | | 11 | null | null | +----------+----------------+--------+ Sub-questions: 1) Specify the exact pandas operations (merges/groupbys) and any indices you would set to make it efficient. 2) Show the final expected row for order_id=10 (verify total_amount, product_count, category_list). 3) Explain how your imputation prevents data leakage if the data later gets split by date for modeling.

PracHub · Accepted Answer

This question evaluates data-wrangling and ETL competencies including normalization of inconsistent schemas, performant joins, aggregation, and hierarchical imputation strategies in pandas, as well as handling of missing values and final schema validation.

Quick Overview

Quick Overview