Explain train-test generalization gap
Company: Bytedance
Role: Data Scientist
Category: Machine Learning
Difficulty: easy
Interview Round: Technical Screen
A model performs very well on the training set but much worse on a held-out test set. Explain why this can happen and how you would diagnose and fix it.
Answer the question for both:
1. **Classical machine learning models** such as linear models, tree-based models, gradient boosting, or SVMs.
2. **Deep learning models** such as neural networks for tabular, text, image, or sequence data.
In your answer, discuss:
- The difference between overfitting, data leakage, and train-test distribution shift.
- How to use training, validation, and test metrics to tell these issues apart.
- Common root causes such as high model complexity, small sample size, noisy labels, poor feature design, duplicate records, non-representative splits, class imbalance, temporal leakage, and metric mismatch.
- Practical remedies for each setting, including regularization, cross-validation, better splitting strategy, feature pruning, hyperparameter tuning, early stopping, dropout, weight decay, data augmentation, transfer learning, and collecting more data.
- What additional checks you would run before concluding that the model is simply overfitting.
Quick Answer: This question evaluates understanding of model generalization and diagnostic competence within Machine Learning and Data Science, covering overfitting, data leakage, and train–test distribution shift across both classical models and deep learning architectures and requiring both conceptual understanding and practical application skills.