PracHub
QuestionsPremiumLearningGuidesCheatsheetNEWCoaches
|Home/Machine Learning/Google

Explain logistic regression vs forests and boosting

Last updated: Mar 29, 2026

Quick Overview

This question evaluates a candidate's mastery of supervised learning concepts including logistic regression's probabilistic modeling and convex optimization, the effects of L1/L2/elastic-net regularization, ensemble methods like random forests and gradient boosting, and practical competencies in diagnostics, probability calibration, and end-to-end pipeline design. It is commonly asked in Machine Learning interviews for data scientist roles because it probes both theoretical foundations (loss functions, gradients, bias–variance trade-offs) and practical application (handling high-dimensional sparse or imbalanced data, hyperparameter sensitivity, and time-based evaluation), testing both conceptual understanding and hands-on model selection.

  • hard
  • Google
  • Machine Learning
  • Data Scientist

Explain logistic regression vs forests and boosting

Company: Google

Role: Data Scientist

Category: Machine Learning

Difficulty: hard

Interview Round: Technical Screen

Answer all parts precisely. 1) Define binary logistic regression: write the model p(y=1|x)=σ(w·x+b). Derive the negative log-likelihood (log-loss) and its gradient with respect to w and b. Explain why the loss is convex and the implications for optimization. 2) Compare L1 vs L2 regularization in logistic regression: effects on sparsity, handling of multicollinearity, margin geometry, and probability calibration. When would you pick elastic net over pure L1 or L2? 3) Under what data conditions does logistic regression often outperform a random forest? Discuss cases such as (a) truly linear or near-linear decision boundaries with limited interactions, (b) high-dimensional sparse binary features (e.g., text), (c) small-n, large-p regimes where strong regularization helps, and (d) when calibrated probabilities and interpretability are priorities. 4) Your model overfits: list concrete remedies tailored to each method: for logistic regression (regularization strength, feature selection, class weighting, calibration, proper CV), for random forests (increase trees, limit depth, max_features, min_samples_* settings, OOB validation), and for boosting (learning rate, number of estimators, max_depth/leaf-wise growth, subsampling, early stopping). Include how you would detect overfitting beyond accuracy (e.g., calibration curves, PR-AUC vs ROC-AUC, decision boundary checks). 5) Contrast random forests vs gradient boosting: bias–variance characteristics, robustness to noisy features, sensitivity to hyperparameters, ability to capture monotonic constraints, handling missing values natively, and training/inference cost. Give one real-world scenario where each clearly dominates the other and justify. 6) Case study: You have 50k rows, 10k sparse binary features, class imbalance 1% positive, and strong temporal drift. Propose an end-to-end pipeline for (a) logistic regression with elastic net and (b) a tree-based method (choose RF or GBDT). Cover feature processing, regularization/hyperparameters, evaluation protocol (time-based CV), threshold selection, probability calibration, and how you’d compare the models fairly. State the pitfalls you will avoid (e.g., leakage via target encoding, improper scaling of sparse inputs).

Quick Answer: This question evaluates a candidate's mastery of supervised learning concepts including logistic regression's probabilistic modeling and convex optimization, the effects of L1/L2/elastic-net regularization, ensemble methods like random forests and gradient boosting, and practical competencies in diagnostics, probability calibration, and end-to-end pipeline design. It is commonly asked in Machine Learning interviews for data scientist roles because it probes both theoretical foundations (loss functions, gradients, bias–variance trade-offs) and practical application (handling high-dimensional sparse or imbalanced data, hyperparameter sensitivity, and time-based evaluation), testing both conceptual understanding and hands-on model selection.

Related Interview Questions

  • Explain ranking cold-start strategies - Google (medium)
  • Explain LLM fine-tuning and generative models - Google (medium)
  • Compare NLP tokenization and LLM recommendations - Google (medium)
  • Explain LLM lifecycle and trade-offs - Google (medium)
  • Build a bigram next-word predictor with weighted sampling - Google (medium)
Google logo
Google
Oct 13, 2025, 9:49 PM
Data Scientist
Technical Screen
Machine Learning
5
0

Technical Screen — Machine Learning

Answer all parts precisely.

1) Binary logistic regression: model, loss, gradient, convexity

  • Define the model: p(y=1 | x) = σ(w · x + b).
  • Derive the negative log-likelihood (log-loss) and its gradient with respect to w and b.
  • Explain why the loss is convex and the implications for optimization.

2) L1 vs L2 regularization in logistic regression

  • Compare their effects on:
    • Sparsity
    • Handling of multicollinearity
    • Margin geometry
    • Probability calibration
  • When would you pick elastic net over pure L1 or L2?

3) When logistic regression can outperform a random forest

Discuss conditions such as:

  • (a) Truly linear or near-linear decision boundaries with limited interactions
  • (b) High-dimensional sparse binary features (e.g., text)
  • (c) Small-n, large-p regimes where strong regularization helps
  • (d) When calibrated probabilities and interpretability are priorities

4) Remedies for overfitting and diagnostics beyond accuracy

  • Logistic regression: regularization strength, feature selection, class weighting, calibration, proper cross-validation.
  • Random forests: increase trees, limit depth, max_features, min_samples_* settings, OOB validation.
  • Boosting: learning rate, number of estimators, max_depth/leaf-wise growth, subsampling, early stopping.
  • How to detect overfitting beyond accuracy (e.g., calibration curves, PR-AUC vs ROC-AUC, decision boundary checks).

5) Random forests vs gradient boosting

Contrast them on:

  • Bias–variance characteristics
  • Robustness to noisy features
  • Sensitivity to hyperparameters
  • Ability to capture monotonic constraints
  • Handling missing values natively
  • Training and inference cost Provide one real-world scenario where each clearly dominates the other and justify.

6) Case study: Imbalanced, sparse, drifting data

Data: 50k rows, 10k sparse binary features, class imbalance (1% positive), strong temporal drift. Propose end-to-end pipelines for:

  • (a) Logistic regression with elastic net
  • (b) A tree-based method (choose RF or GBDT) Cover: feature processing, regularization/hyperparameters, evaluation protocol (time-based CV), threshold selection, probability calibration, and how to compare the models fairly. State pitfalls to avoid (e.g., leakage via target encoding, improper scaling of sparse inputs).

Solution

Show

Comments (0)

Sign in to leave a comment

Loading comments...

Browse More Questions

More Machine Learning•More Google•More Data Scientist•Google Data Scientist•Google Machine Learning•Data Scientist Machine Learning
PracHub

Master your tech interviews with 7,500+ real questions from top companies.

Product

  • Questions
  • Learning Tracks
  • Interview Guides
  • Resources
  • Premium
  • For Universities
  • Student Access

Browse

  • By Company
  • By Role
  • By Category
  • Topic Hubs
  • SQL Questions
  • Compare Platforms
  • Discord Community

Support

  • support@prachub.com
  • (916) 541-4762

Legal

  • Privacy Policy
  • Terms of Service
  • About Us

© 2026 PracHub. All rights reserved.