End-to-End System Design: Predict Residential Property Sale Prices
Context
You are tasked with building a production-grade machine learning system to predict the sale price of residential properties in a large city. You have ~10 years of historical, geocoded sales with property attributes, plus external data (transit GTFS, schools, zoning, crime, permits, points of interest, macroeconomic indicators). The system must generalize across time and neighborhoods, avoid leakage, and be explainable and compliant.
Requirements
-
Features and Engineering
-
Enumerate key feature groups: geospatial, transit accessibility, school quality, neighborhood effects, time-of-sale, macro factors, property attributes, and environmental factors.
-
Specify feature engineering: e.g., distance/travel-time to POIs, spatial lags and neighborhood aggregates, encodings for high-cardinality categoricals.
-
Explain how you will handle high-cardinality categoricals (e.g., neighborhood, school, zip) without leakage.
-
Training and Validation Strategy
-
Propose a time-aware and spatially blocked cross-validation that avoids leakage and overly optimistic estimates.
-
State and justify evaluation metrics (e.g., RMSLE vs. RMSE vs. MAPE) and calibration checks.
-
Model Choices and Interpretability
-
Compare GBMs vs. Random Forests vs. linear models with interactions/GAMs, and propose a final approach.
-
Provide an interpretability plan (global and local), including how to communicate drivers of price.
-
Fairness and Compliance
-
Identify potential proxies for protected classes (e.g., redlining risks) and how you will mitigate and test for them.
-
Outline documentation and reviews needed for compliance.
-
Deployment, Monitoring, Ablations, and Error Analysis
-
Describe deployment architecture (batch vs. online), feature store, retraining cadence, and CI/CD.
-
List monitoring KPIs (performance, drift, calibration) and segment-based alerts.
-
Explain ablation studies and error analysis you would run to improve the model and build trust.