PracHub
QuestionsPremiumLearningGuidesCheatsheetNEWCoaches

Quick Overview

This question evaluates data-wrangling and ETL competencies including normalization of inconsistent schemas, performant joins, aggregation, and hierarchical imputation strategies in pandas, as well as handling of missing values and final schema validation.

  • Medium
  • Boston Consulting Group
  • Data Manipulation (SQL/Python)
  • Data Scientist

Unify 7 tables and impute missing values

Company: Boston Consulting Group

Role: Data Scientist

Category: Data Manipulation (SQL/Python)

Difficulty: Medium

Interview Round: Take-home Project

Using pandas, write a robust function unify_orders(...) that ingests seven dataframes (or CSVs) with possibly inconsistent column casing/whitespace and returns a single denormalized OrdersAnalytics table with exact columns and order: [order_id, order_date, customer_id, customer_name, shipper_name, total_amount, product_count, category_list, payment_status]. Rules: - order_date must be a YYYY-MM-DD string; - total_amount is sum(quantity*unit_price) across items per order; - product_count is count of distinct product_id per order; - category_list is ';'-joined, deduplicated, alphabetically sorted category_name per order; - Keep orders even if shipper or payment is missing (shipper_name may be null; payment_status becomes 'unknown'); - No extra/missing columns; assert returned_df.columns == [...]. Handle missing values: unit_price imputed by product-level median; if unavailable, use category-level median; if still missing, use global median across order_items. quantity missing -> impute 1. Payment amount missing -> recompute from items; payment_status missing -> 'unknown'. Normalize column names to snake_case and strip cell whitespace before processing. Provide O(N log N) or better joins and avoid quadratic loops. Use the following small ASCII samples to illustrate joins and expected aggregation behavior (you do not need to hardcode these): customers: +-------------+-------------+ | customer_id | name | +-------------+-------------+ | 1 | Ada Lovelace| | 2 | A. Turing | +-------------+-------------+ orders: +----------+-------------+------------+ | order_id | customer_id | shipper_id | +----------+-------------+------------+ | 10 | 1 | 100 | | 11 | 1 | null | | 12 | 2 | 101 | +----------+-------------+------------+ (OrderDate column may appear as 'OrderDate' or 'order_date' in files; assume values: 2025-05-01 for 10, 2025-05-03 for 11, 2025-05-04 for 12.) order_items: +----------+------------+----------+------------+ | order_id | product_id | quantity | unit_price | +----------+------------+----------+------------+ | 10 | 501 | 2 | 30.0 | | 10 | 502 | null | 10.0 | | 11 | 501 | 1 | null | | 12 | 503 | 3 | 7.5 | +----------+------------+----------+------------+ products: +------------+--------------+-------------+ | product_id | product_name | category_id | +------------+--------------+-------------+ | 501 | Widget A | 9001 | | 502 | Gadget B | 9002 | | 503 | Gizmo C | 9001 | +------------+--------------+-------------+ categories: +-------------+---------------+ | category_id | category_name | +-------------+---------------+ | 9001 | Tools | | 9002 | Accessories | +-------------+---------------+ shippers: +------------+--------------+ | shipper_id | shipper_name | +------------+--------------+ | 100 | FastShip | | 101 | SureShip | +------------+--------------+ payments: +----------+----------------+--------+ | order_id | payment_status | amount | +----------+----------------+--------+ | 10 | paid | 70.0 | | 11 | null | null | +----------+----------------+--------+ Sub-questions: 1) Specify the exact pandas operations (merges/groupbys) and any indices you would set to make it efficient. 2) Show the final expected row for order_id=10 (verify total_amount, product_count, category_list). 3) Explain how your imputation prevents data leakage if the data later gets split by date for modeling.

Quick Answer: This question evaluates data-wrangling and ETL competencies including normalization of inconsistent schemas, performant joins, aggregation, and hierarchical imputation strategies in pandas, as well as handling of missing values and final schema validation.

Last updated: Mar 29, 2026

Loading coding console...

PracHub

Master your tech interviews with 7,500+ real questions from top companies.

Product

  • Questions
  • Learning Tracks
  • Interview Guides
  • Resources
  • Premium
  • For Universities
  • Student Access

Browse

  • By Company
  • By Role
  • By Category
  • Topic Hubs
  • SQL Questions
  • Compare Platforms
  • Discord Community

Support

  • support@prachub.com
  • (916) 541-4762

Legal

  • Privacy Policy
  • Terms of Service
  • About Us

© 2026 PracHub. All rights reserved.

Related Coding Questions

  • Transform messy transactions with pandas - Boston Consulting Group (Medium)
  • Query top spenders and 7-day growth - Boston Consulting Group (Medium)
  • Manipulate and merge DataFrames correctly - Boston Consulting Group (Medium)
  • Transform and aggregate messy event data - Boston Consulting Group (Medium)
  • Merge and Concatenate Inconsistent Order Files with Pandas - Boston Consulting Group (Medium)