Imbalanced Classification & Regression: ROC/PR, Losses, and Training Strategies
You are evaluating a binary classifier and a regression head in a machine learning take-home. Answer all parts concisely but show your steps where calculations are requested.
A) ROC Curve and AUC from Toy Scores
Given scores for 5 positives and 5 negatives, sweep the decision threshold from +∞ down to −∞. At equal scores (if any), break ties by ranking positives above negatives.
-
Positives: P1 = 0.99, P2 = 0.80, P3 = 0.60, P4 = 0.40, P5 = 0.10
-
Negatives: N1 = 0.95, N2 = 0.70, N3 = 0.55, N4 = 0.30, N5 = 0.05
Tasks:
-
List the ROC points (FPR, TPR) encountered as you lower the threshold.
-
Compute ROC-AUC using the trapezoidal rule; show the segment-by-segment calculation.
-
Interpret the AUC as the probability a random positive ranks above a random negative.
B) Metrics Under 1% Prevalence
With prevalence = 1% (positives are rare):
-
Explain why overall accuracy can be misleading.
-
Propose two better metrics for model selection.
-
State when you would prefer ROC-AUC vs PR-AUC.
C) MSE vs MAE as Regression Losses
For a regression head:
-
Write each loss and derive the gradient with respect to the prediction.
-
Explain robustness to outliers and optimization behavior.
-
Give one practical scenario where each is preferable.
D) Improving an Imbalanced Binary Classifier (Neural Network)
On the same 1% prevalence task, propose two concrete architecture/training changes (e.g., focal loss with typical γ, α; class weighting; positive down/up-sampling; thresholding strategy). For each, discuss likely effects on calibration and on recall.