PracHub
QuestionsPremiumLearningGuidesCheatsheetNEWCoaches
|Home/Machine Learning/Apple

Implement random forest with OOB and imbalance

Last updated: Mar 29, 2026

Quick Overview

This question evaluates expertise in implementing and engineering a memory- and compute-efficient Random Forest for binary classification, covering ensemble methods, CART/Gini impurity, OOB evaluation and calibration, class-imbalance handling, reproducibility, parallel and thread-safe system design, and streaming/warm-start model updates.

  • hard
  • Apple
  • Machine Learning
  • Data Scientist

Implement random forest with OOB and imbalance

Company: Apple

Role: Data Scientist

Category: Machine Learning

Difficulty: hard

Interview Round: Onsite

Implement a binary‑classification Random Forest from scratch for N=200,000, d=100 (mixed numerical and high‑cardinality categorical), with a 2 GB memory budget. Requirements: (a) Trees: CART with Gini impurity, max_depth, min_samples_leaf; handle missing values via surrogate splits or median/mode imputation; support categorical splits without one‑hot. (b) Bagging and feature bagging: bootstrap per tree; m_try=⌊√d⌋ at each split; ensure deterministic reproducibility with per‑tree seeds. (c) Out‑of‑bag (OOB) evaluation: compute OOB ROC‑AUC, PR‑AUC, and a reliability diagram; derive class‑probability calibration using OOB Platt scaling vs isotonic—justify when each is preferable. (d) Severe class imbalance: positive rate ≈1%; incorporate a 10× cost for FN vs FP into split selection and sampling (class_weight or stratified bootstrap). (e) Parallelization: design thread‑safe data structures and quantify training/inference complexity and memory; estimate wall‑clock time on 8 cores. (f) Streaming/warm‑start: add trees over time without retraining existing ones; discuss concept‑drift detection with OOB metrics. Provide clear pseudocode for train(), predict_proba(), and OOB evaluation, and justify all design choices.

Quick Answer: This question evaluates expertise in implementing and engineering a memory- and compute-efficient Random Forest for binary classification, covering ensemble methods, CART/Gini impurity, OOB evaluation and calibration, class-imbalance handling, reproducibility, parallel and thread-safe system design, and streaming/warm-start model updates.

Related Interview Questions

  • Implement Masked Multi-Head Self-Attention - Apple (easy)
  • Compare DCN v1 vs v2 and A/B test - Apple (medium)
  • Explain dataset size, generalization, and U-Net skips - Apple (medium)
  • Analyze vision model failures - Apple (medium)
  • Compare audio preprocessing and training - Apple (medium)
Apple logo
Apple
Oct 13, 2025, 9:49 PM
Data Scientist
Onsite
Machine Learning
3
0

Implement a Memory-Efficient Random Forest (Binary Classification) Under Constraints

You are asked to design and implement a Random Forest for binary classification under the following constraints. Assume a dataset with N = 200,000 rows and d = 100 features (a mix of numerical and high-cardinality categorical), and a total memory budget of 2 GB. Your design should be robust enough for a production environment and suitable for an onsite interview discussion.

Requirements

  1. Trees
  • Use CART with Gini impurity.
  • Hyperparameters: max_depth, min_samples_leaf.
  • Missing values: either surrogate splits or median/mode imputation.
  • Categorical splits: support directly (no one-hot encoding).
  1. Bagging and Feature Bagging
  • Bootstrap sampling per tree (bagging).
  • Feature subsampling per node with m_try = ⌊√d⌋.
  • Deterministic reproducibility with per-tree seeds.
  1. Out-of-Bag (OOB) Evaluation
  • Compute OOB ROC-AUC, PR-AUC, and a reliability diagram.
  • Calibrate class probabilities using OOB predictions: compare Platt scaling vs isotonic regression; justify when each is preferable.
  1. Severe Class Imbalance
  • Positive rate ≈ 1%.
  • Incorporate a 10× misclassification cost for FN vs FP into both split selection and sampling (e.g., class_weight or stratified bootstrap).
  1. Parallelization and Systems Aspects
  • Design thread-safe data structures.
  • Quantify training and inference time complexity and memory usage.
  • Estimate wall-clock time on 8 cores.
  1. Streaming / Warm-Start
  • Support adding trees over time without retraining existing trees.
  • Discuss concept-drift detection using OOB metrics.
  1. Deliverables
  • Clear pseudocode for train(), predict_proba(), and OOB evaluation.
  • Justify all design choices.

Assume m_try = ⌊√100⌋ = 10. If information is missing (e.g., exact numeric vs categorical split), make minimal, explicit assumptions to complete the design.

Solution

Show

Comments (0)

Sign in to leave a comment

Loading comments...

Browse More Questions

More Machine Learning•More Apple•More Data Scientist•Apple Data Scientist•Apple Machine Learning•Data Scientist Machine Learning
PracHub

Master your tech interviews with 7,500+ real questions from top companies.

Product

  • Questions
  • Learning Tracks
  • Interview Guides
  • Resources
  • Premium
  • For Universities
  • Student Access

Browse

  • By Company
  • By Role
  • By Category
  • Topic Hubs
  • SQL Questions
  • Compare Platforms
  • Discord Community

Support

  • support@prachub.com
  • (916) 541-4762

Legal

  • Privacy Policy
  • Terms of Service
  • About Us

© 2026 PracHub. All rights reserved.