Model Selection for Binary Classification with Limited Data and Potential Non-Linearities
Scenario
You are designing a binary classifier with limited labeled data. The signal may be partly non-linear, and you care about generalization and interpretability.
Questions
-
What is logistic regression, and what is its loss function? Briefly note its optimization properties (convexity).
-
When can logistic regression outperform a Random Forest?
-
Explain L1 and L2 regularization and their effects (e.g., sparsity, multicollinearity).
-
How would you detect and mitigate overfitting in logistic regression?
-
Compare Random Forest and Boosting (e.g., Gradient Boosting) in terms of bias, variance, interpretability, and typical use cases. Include thoughts on ensemble diversity and probability calibration.