Clean and Merge Housing Data

Q: Clean and Merge Housing Data

This question evaluates data cleaning, type coercion, robust merge operations, and missing-data imputation skills within tabular datasets, emphasizing pandas-based DataFrame manipulation and integration of auxiliary zipcode features.

Q: How do I approach Data Manipulation (SQL/Python) interview questions?

Data Manipulation (SQL/Python) questions require understanding of core concepts and practice. PracHub provides solutions with explanations to help you master data manipulation (sql/python) interviews.

Q: What difficulty level is this interview question?

This is a easy difficulty Data Manipulation (SQL/Python) question, commonly asked during Onsite rounds at Capital One.

Q: What role is this question designed for?

This question is commonly asked for Data Scientist candidates at Capital One during technical interviews.

Question

Loading...

You are given two pandas DataFrames from a house-price screening exercise.

house_sales_raw

listing_id INT
sale_price FLOAT, nullable in rows reserved for scoring
sqft_living STRING
lot_size STRING
bedrooms STRING
bathrooms STRING
year_built STRING
zipcode STRING
house_type STRING
garage_spaces STRING, nullable
condition_score STRING, nullable

zipcode_features

zipcode STRING
median_income FLOAT
school_score FLOAT

Relationship:

house_sales_raw.zipcode = zipcode_features.zipcode

Tasks:

Convert numeric-like string columns to the correct pandas numeric dtype. Invalid parses should become null rather than causing failures.
Merge house_sales_raw with zipcode_features on zipcode .
Fill missing values using simple imputation rules: median for numeric columns, mode for categorical columns, and create explicit indicator columns for imputed fields where appropriate.
Return a cleaned DataFrame with one row per listing_id and the following output columns: listing_id, sale_price, sqft_living, lot_size, bedrooms, bathrooms, year_built, zipcode, house_type, garage_spaces, condition_score, median_income, school_score .

You may assume the exercise is purely tabular and no timezone handling is needed.

Clean and Merge Housing Data

Quick Overview

Comments (0)