Unify 7 tables and impute missing values

Q: Unify 7 tables and impute missing values

This is a Data Manipulation (SQL/Python) interview question from Boston Consulting Group for Data Scientist roles. View the full question and solution on PracHub.

Q: How do I approach Data Manipulation (SQL/Python) interview questions?

Data Manipulation (SQL/Python) questions require understanding of core concepts and practice. PracHub provides solutions with explanations to help you master data manipulation (sql/python) interviews.

Question

Using pandas, write a robust function unify_orders(...) that ingests seven dataframes (or CSVs) with possibly inconsistent column casing/whitespace and returns a single denormalized OrdersAnalytics table with exact columns and order: [order_id, order_date, customer_id, customer_name, shipper_name, total_amount, product_count, category_list, payment_status]. Rules: - order_date must be a YYYY-MM-DD string; - total_amount is sum(quantity*unit_price) across items per order; - product_count is count of distinct product_id per order; - category_list is ';'-joined, deduplicated, alphabetically sorted category_name per order; - Keep orders even if shipper or payment is missing (shipper_name may be null; payment_status becomes 'unknown'); - No extra/missing columns; assert returned_df.columns == [...]. Handle missing values: unit_price imputed by product-level median; if unavailable, use category-level median; if still missing, use global median across order_items. quantity missing -> impute 1. Payment amount missing -> recompute from items; payment_status missing -> 'unknown'. Normalize column names to snake_case and strip cell whitespace before processing. Provide O(N log N) or better joins and avoid quadratic loops. Use the following small ASCII samples to illustrate joins and expected aggregation behavior (you do not need to hardcode these): customers: +-------------+-------------+ | customer_id | name | +-------------+-------------+ | 1 | Ada Lovelace| | 2 | A. Turing | +-------------+-------------+ orders: +----------+-------------+------------+ | order_id | customer_id | shipper_id | +----------+-------------+------------+ | 10 | 1 | 100 | | 11 | 1 | null | | 12 | 2 | 101 | +----------+-------------+------------+ (OrderDate column may appear as 'OrderDate' or 'order_date' in files; assume values: 2025-05-01 for 10, 2025-05-03 for 11, 2025-05-04 for 12.) order_items: +----------+------------+----------+------------+ | order_id | product_id | quantity | unit_price | +----------+------------+----------+------------+ | 10 | 501 | 2 | 30.0 | | 10 | 502 | null | 10.0 | | 11 | 501 | 1 | null | | 12 | 503 | 3 | 7.5 | +----------+------------+----------+------------+ products: +------------+--------------+-------------+ | product_id | product_name | category_id | +------------+--------------+-------------+ | 501 | Widget A | 9001 | | 502 | Gadget B | 9002 | | 503 | Gizmo C | 9001 | +------------+--------------+-------------+ categories: +-------------+---------------+ | category_id | category_name | +-------------+---------------+ | 9001 | Tools | | 9002 | Accessories | +-------------+---------------+ shippers: +------------+--------------+ | shipper_id | shipper_name | +------------+--------------+ | 100 | FastShip | | 101 | SureShip | +------------+--------------+ payments: +----------+----------------+--------+ | order_id | payment_status | amount | +----------+----------------+--------+ | 10 | paid | 70.0 | | 11 | null | null | +----------+----------------+--------+ Sub-questions: 1) Specify the exact pandas operations (merges/groupbys) and any indices you would set to make it efficient. 2) Show the final expected row for order_id=10 (verify total_amount, product_count, category_list). 3) Explain how your imputation prevents data leakage if the data later gets split by date for modeling.

Unify 7 tables and impute missing values

Comments (0)