Design features for house price prediction
Company: Two Sigma
Role: Data Scientist
Category: Machine Learning
Difficulty: nan
Interview Round: Technical Screen
## Scenario
You are building a model to predict **house sale price** from a tabular dataset (similar to typical real-estate datasets). The interviewer expects a simple baseline model (e.g., linear regression), but wants to understand your reasoning.
## Questions
1. **Which features are likely to be predictive** of house price, and why? (Examples: location, size, age, condition, amenities, nearby schools, etc.)
2. **How do you decide which features are usable** (available at prediction time, not leaking label information, stable definitions)?
3. What **data cleaning** steps would you perform before modeling?
4. If starting with **linear regression**, how would you:
- handle missing values,
- handle categorical variables,
- reduce the impact of outliers/skewed price distributions,
- detect multicollinearity and mitigate it?
5. How would you evaluate the model and iterate on improvements?
Assume you have a training set with historical sales and a holdout set for evaluation.
Quick Answer: This question evaluates competence in feature engineering, data preprocessing, baseline regression modeling, and model evaluation for tabular price prediction problems.