Practical ML questions (classification and generalization)
Answer the following ML engineering/data science questions.
A) Class imbalance
You’re training a classifier where the positive class is rare.
-
How do you handle
class imbalance
(data-level and algorithm-level approaches)?
-
Which
evaluation metrics
are appropriate and why (e.g., accuracy vs precision/recall/F1/ROC-AUC/PR-AUC)?
-
What pitfalls should you watch for (e.g., calibration, thresholding, leakage)?
B) Training on a sample from a very large dataset
You train a model on a sample drawn from a massive dataset.
-
How do you verify the
sample is representative
?
-
How do you validate that a model trained on the sample will
generalize
to the full population?
C) Preventing overfitting in tree-based models
For decision trees / random forests / gradient-boosted trees:
-
What knobs and practices help prevent
overfitting
?
D) Why L1/L2 regularization is biased
Explain why L1 (Lasso) and L2 (Ridge) regularization typically produce biased coefficient estimates, and why we still use them.