Explain logistic regression vs forests and boosting
Company: Google
Role: Data Scientist
Category: Machine Learning
Difficulty: hard
Interview Round: Technical Screen
Answer all parts precisely.
1) Define binary logistic regression: write the model p(y=1|x)=σ(w·x+b). Derive the negative log-likelihood (log-loss) and its gradient with respect to w and b. Explain why the loss is convex and the implications for optimization.
2) Compare L1 vs L2 regularization in logistic regression: effects on sparsity, handling of multicollinearity, margin geometry, and probability calibration. When would you pick elastic net over pure L1 or L2?
3) Under what data conditions does logistic regression often outperform a random forest? Discuss cases such as (a) truly linear or near-linear decision boundaries with limited interactions, (b) high-dimensional sparse binary features (e.g., text), (c) small-n, large-p regimes where strong regularization helps, and (d) when calibrated probabilities and interpretability are priorities.
4) Your model overfits: list concrete remedies tailored to each method: for logistic regression (regularization strength, feature selection, class weighting, calibration, proper CV), for random forests (increase trees, limit depth, max_features, min_samples_* settings, OOB validation), and for boosting (learning rate, number of estimators, max_depth/leaf-wise growth, subsampling, early stopping). Include how you would detect overfitting beyond accuracy (e.g., calibration curves, PR-AUC vs ROC-AUC, decision boundary checks).
5) Contrast random forests vs gradient boosting: bias–variance characteristics, robustness to noisy features, sensitivity to hyperparameters, ability to capture monotonic constraints, handling missing values natively, and training/inference cost. Give one real-world scenario where each clearly dominates the other and justify.
6) Case study: You have 50k rows, 10k sparse binary features, class imbalance 1% positive, and strong temporal drift. Propose an end-to-end pipeline for (a) logistic regression with elastic net and (b) a tree-based method (choose RF or GBDT). Cover feature processing, regularization/hyperparameters, evaluation protocol (time-based CV), threshold selection, probability calibration, and how you’d compare the models fairly. State the pitfalls you will avoid (e.g., leakage via target encoding, improper scaling of sparse inputs).
Quick Answer: This question evaluates a candidate's mastery of supervised learning concepts including logistic regression's probabilistic modeling and convex optimization, the effects of L1/L2/elastic-net regularization, ensemble methods like random forests and gradient boosting, and practical competencies in diagnostics, probability calibration, and end-to-end pipeline design. It is commonly asked in Machine Learning interviews for data scientist roles because it probes both theoretical foundations (loss functions, gradients, bias–variance trade-offs) and practical application (handling high-dimensional sparse or imbalanced data, hyperparameter sensitivity, and time-based evaluation), testing both conceptual understanding and hands-on model selection.