Clean and Merge Housing Data
Company: Capital One
Role: Data Scientist
Category: Data Manipulation (SQL/Python)
Difficulty: easy
Interview Round: Onsite
You are given two pandas DataFrames from a house-price screening exercise.
`house_sales_raw`
- `listing_id` INT
- `sale_price` FLOAT, nullable in rows reserved for scoring
- `sqft_living` STRING
- `lot_size` STRING
- `bedrooms` STRING
- `bathrooms` STRING
- `year_built` STRING
- `zipcode` STRING
- `house_type` STRING
- `garage_spaces` STRING, nullable
- `condition_score` STRING, nullable
`zipcode_features`
- `zipcode` STRING
- `median_income` FLOAT
- `school_score` FLOAT
Relationship:
- `house_sales_raw.zipcode = zipcode_features.zipcode`
Tasks:
1. Convert numeric-like string columns to the correct pandas numeric dtype. Invalid parses should become null rather than causing failures.
2. Merge `house_sales_raw` with `zipcode_features` on `zipcode`.
3. Fill missing values using simple imputation rules: median for numeric columns, mode for categorical columns, and create explicit indicator columns for imputed fields where appropriate.
4. Return a cleaned DataFrame with one row per `listing_id` and the following output columns:
`listing_id, sale_price, sqft_living, lot_size, bedrooms, bathrooms, year_built, zipcode, house_type, garage_spaces, condition_score, median_income, school_score`.
You may assume the exercise is purely tabular and no timezone handling is needed.
Quick Answer: This question evaluates data cleaning, type coercion, robust merge operations, and missing-data imputation skills within tabular datasets, emphasizing pandas-based DataFrame manipulation and integration of auxiliary zipcode features.