PracHub
QuestionsPremiumLearningGuidesInterview PrepNEWCoaches
|Home/Machine Learning/Google

Handle highly imbalanced classification data

Last updated: Mar 29, 2026

Quick Overview

This question evaluates a candidate's competency in handling highly imbalanced binary classification problems, including data splitting and leakage prevention, imbalance mitigation techniques, appropriate metric selection and threshold calibration, algorithm selection for scalability, robust validation, and deployment monitoring.

  • Medium
  • Google
  • Machine Learning
  • Data Scientist

Handle highly imbalanced classification data

Company: Google

Role: Data Scientist

Category: Machine Learning

Difficulty: Medium

Interview Round: Technical Screen

You must build a binary classifier for fraud with a 0.2% positive rate and 10M rows × 500 features. Propose an end-to-end plan that covers: 1) data splitting with stratification and leakage prevention; 2) handling imbalance (class weights vs. focal loss, down/over-sampling, SMOTE variants, and when to use each); 3) appropriate metrics and why (PR curve, AUPRC, recall at fixed precision, cost-sensitive metrics) vs. why ROC-AUC is misleading; 4) threshold setting using cost matrices and calibration (Platt/Isotonic) and how you’d do post-deployment threshold tuning; 5) algorithm choices and justification (baseline logistic with class_weight, tree ensembles with balanced subsampling, anomaly detection fallback); 6) robust validation (time-based CV, group CV), data drift monitoring, and rejection rules for extreme edge cases; 7) a brief pseudocode of the training/evaluation loop that scales to this dataset.

Quick Answer: This question evaluates a candidate's competency in handling highly imbalanced binary classification problems, including data splitting and leakage prevention, imbalance mitigation techniques, appropriate metric selection and threshold calibration, algorithm selection for scalability, robust validation, and deployment monitoring.

Related Interview Questions

  • Explain ranking cold-start strategies - Google (medium)
  • Explain LLM fine-tuning and generative models - Google (medium)
  • Compare NLP tokenization and LLM recommendations - Google (medium)
  • Explain LLM lifecycle and trade-offs - Google (medium)
  • Build a bigram next-word predictor with weighted sampling - Google (medium)
Google logo
Google
Oct 13, 2025, 9:49 PM
Data Scientist
Technical Screen
Machine Learning
9
0

You must build a binary classifier for fraud with a 0.2% positive rate and 10M rows × 500 features. Propose an end-to-end plan that covers: 1) data splitting with stratification and leakage prevention; 2) handling imbalance (class weights vs. focal loss, down/over-sampling, SMOTE variants, and when to use each); 3) appropriate metrics and why (PR curve, AUPRC, recall at fixed precision, cost-sensitive metrics) vs. why ROC-AUC is misleading; 4) threshold setting using cost matrices and calibration (Platt/Isotonic) and how you’d do post-deployment threshold tuning; 5) algorithm choices and justification (baseline logistic with class_weight, tree ensembles with balanced subsampling, anomaly detection fallback); 6) robust validation (time-based CV, group CV), data drift monitoring, and rejection rules for extreme edge cases; 7) a brief pseudocode of the training/evaluation loop that scales to this dataset.

Comments (0)

Sign in to leave a comment

Loading comments...

Browse More Questions

More Machine Learning•More Google•More Data Scientist•Google Data Scientist•Google Machine Learning•Data Scientist Machine Learning
PracHub

Master your tech interviews with 7,500+ real questions from top companies.

Product

  • Questions
  • Learning Tracks
  • Interview Guides
  • Resources
  • Premium
  • For Universities
  • Student Access

Browse

  • By Company
  • By Role
  • By Category
  • Topic Hubs
  • SQL Questions
  • Compare Platforms
  • Discord Community

Support

  • support@prachub.com
  • (916) 541-4762

Legal

  • Privacy Policy
  • Terms of Service
  • About Us

© 2026 PracHub. All rights reserved.