Scenario
You are building a model to predict house sale price from a tabular dataset (similar to typical real-estate datasets). The interviewer expects a simple baseline model (e.g., linear regression), but wants to understand your reasoning.
Questions
-
Which features are likely to be predictive
of house price, and why? (Examples: location, size, age, condition, amenities, nearby schools, etc.)
-
How do you decide which features are usable
(available at prediction time, not leaking label information, stable definitions)?
-
What
data cleaning
steps would you perform before modeling?
-
If starting with
linear regression
, how would you:
-
handle missing values,
-
handle categorical variables,
-
reduce the impact of outliers/skewed price distributions,
-
detect multicollinearity and mitigate it?
-
How would you evaluate the model and iterate on improvements?
Assume you have a training set with historical sales and a holdout set for evaluation.