Handling Missing Values for LGD Modeling
Context
You are building a Loss Given Default (LGD) model using account- and borrower-level features captured around the time of default. The dataset contains both continuous and categorical variables with non-trivial missingness due to reporting gaps, system migrations, and process differences across products/regions.
Task
Describe how you would handle missing values in this LGD modeling context. Specifically, compare the following approaches:
-
Multiple imputation (e.g., MICE)
-
Model-based imputation (e.g., kNN, random forest, regression)
-
Business-rule fills (domain-driven heuristics)
-
Indicator variables (missingness flags; Unknown category)
-
Leaving missingness explicit (letting the model handle NA directly)
For each, explain:
-
Assumptions about the missingness mechanism (MCAR, MAR, MNAR)
-
Pros, cons, and typical use cases in LGD modeling
-
Guardrails to avoid bias and leakage
-
How you would validate the choice and measure impact on LGD performance and stability