Handle imbalance, sampling, and overfitting
Company: LinkedIn
Role: Data Scientist
Category: Machine Learning
Difficulty: medium
Interview Round: Technical Screen
# Machine Learning Fundamentals: Imbalance, Sampling, Overfitting, and Regularization
You are asked several machine learning fundamentals questions in a data science technical screen.
### Constraints & Assumptions
- Explain both practical steps and the reason each step matters.
- Tie evaluation metrics to the business cost of false positives and false negatives.
- Distinguish model-training fixes from evaluation-design fixes.
- For regularization, explain why a biased estimator can still generalize better.
### Clarifying Questions to Ask
- What is the positive class, and what is the cost of missing it?
- How imbalanced is the target, and are labels reliable?
- Is the sample drawn randomly, stratified, or based on a business rule?
- Which tree-based model are we using: a single tree, random forest, gradient boosting, or another ensemble?
### Part 1 - Class Imbalance
You are building a binary classifier with a highly imbalanced target. How would you handle the imbalance during training, and how would you evaluate the model?
#### What This Part Should Cover
- Avoid accuracy as the only metric.
- Use class weights, threshold tuning, undersampling, oversampling, synthetic sampling where appropriate, or anomaly-detection framing for rare events.
- Evaluate with precision, recall, F1, PR-AUC, confusion matrix, calibration, and business-cost curves.
- Keep train/test splits representative and prevent leakage from resampling.
### Part 2 - Sampling From a Huge Dataset
The full dataset is too large to train on directly, so you train using a sample. How would you verify that the sample is representative of the full dataset and that the resulting model generalizes well to the full population?
#### What This Part Should Cover
- Compare feature, label, segment, time, and geography distributions between sample and full population.
- Use stratified or weighted sampling when important groups are rare.
- Hold out an unbiased evaluation set from the full population if possible.
- Check performance stability across multiple samples or bootstrap draws.
- Monitor whether the model underperforms on underrepresented segments.
### Part 3 - Tree-Based Overfitting
You are using a tree-based model. How would you prevent overfitting?
#### What This Part Should Cover
- For single trees: max depth, min samples per leaf, min impurity decrease, pruning, and feature constraints.
- For ensembles: bagging, random forests, shrinkage, subsampling, early stopping, and validation-based hyperparameter tuning.
- Use cross-validation, time-aware validation when needed, and simpler features when leakage or noise is suspected.
### Part 4 - Bias From L1 and L2 Regularization
Why are L1- and L2-regularized estimators biased, and why can they still outperform an unbiased estimator on out-of-sample prediction?
#### What This Part Should Cover
- L1 and L2 add penalties that shrink coefficients toward zero.
- This shrinkage changes the expected coefficient estimates, so the estimators are generally biased.
- L1 can set coefficients exactly to zero; L2 shrinks them smoothly.
- The added bias can reduce variance enough to improve generalization.
### What a Strong Answer Covers
A strong answer connects model choices to data distribution, evaluation design, leakage prevention, segment performance, and the bias-variance trade-off. It should be practical enough to guide a real model review.
### Follow-up Questions
- Why is PR-AUC often more informative than ROC-AUC for rare positives?
- How would you prevent oversampling leakage?
- What would you do if the sample performs well overall but poorly for a small segment?
- How does early stopping regularize boosted trees?
Quick Answer: LinkedIn data scientist machine learning screen covering class imbalance, representative sampling from huge datasets, tree-model overfitting controls, leakage-safe evaluation, and why L1/L2 regularization introduces useful bias.