Design a house-price prediction model
Company: Two Sigma
Role: Data Scientist
Category: Machine Learning
Difficulty: easy
Interview Round: Technical Screen
## Problem
You are asked to build a model to **predict house sale prices** for a city of your choice.
### Data (assume typical real-estate fields)
You have a historical dataset of home listings/sales with (examples):
- `sale_id` (string, unique)
- `city` (string)
- `sale_date` (date)
- `sale_price` (float, target)
- `bedrooms` (int), `bathrooms` (float), `sqft` (float), `lot_sqft` (float)
- `year_built` (int)
- `zipcode`/`neighborhood` (string)
- `lat`/`lon` (float)
- `property_type` (categorical)
- `days_on_market` (int)
- Optional external features (if you choose): school ratings, crime, interest rates, nearby transit, etc.
### Tasks
1. Propose an end-to-end approach (data cleaning → feature engineering → model selection).
2. Define how you would split data (train/validation/test) and **avoid leakage**.
3. Choose evaluation metrics and justify them (e.g., MAE vs RMSE vs MAPE).
4. Explain how you would handle:
- missing values and outliers
- high-cardinality location features (zip/neighborhood)
- temporal drift (market changes)
5. Describe how you would interpret the model and communicate results to stakeholders.
Quick Answer: This question evaluates machine learning and data science competencies including regression model design, feature engineering, data-splitting and leakage prevention, selection of evaluation metrics, handling missing values, outliers and high-cardinality location features, temporal drift management, and model interpretation for house-price prediction. It is commonly asked in the Machine Learning domain to assess end-to-end practical application and conceptual understanding of validation and metric trade-offs, primarily testing practical application supported by conceptual reasoning and stakeholder communication.