This question evaluates competencies in handling class imbalance, choosing and interpreting evaluation metrics and decision thresholds, validating sample representativeness and model generalization from very large datasets, mitigating overfitting in decision-tree and ensemble models, and understanding how L1/L2 regularization introduces bias, all within the Machine Learning domain for Data Scientist roles. It is commonly asked to assess both practical application skills—such as model validation, sampling and hyperparameter controls—and conceptual understanding of bias–variance and regularization trade-offs, indicating readiness for production-grade supervised learning problems.
Answer the following applied ML questions.
You’re building a binary classifier where positives are rare.
You have an extremely large dataset, so you train on a sample.
For decision trees / gradient-boosted trees / random forests:
Explain why adding L1 (lasso) or L2 (ridge) regularization introduces bias, and why it can still improve generalization.