Implement a binary‑classification Random Forest from scratch for N=200,000, d=100 (mixed numerical and high‑cardinality categorical), with a 2 GB memory budget. Requirements: (a) Trees: CART with Gini impurity, max_depth, min_samples_leaf; handle missing values via surrogate splits or median/mode imputation; support categorical splits without one‑hot. (b) Bagging and feature bagging: bootstrap per tree; m_try=⌊√d⌋ at each split; ensure deterministic reproducibility with per‑tree seeds. (c) Out‑of‑bag (OOB) evaluation: compute OOB ROC‑AUC, PR‑AUC, and a reliability diagram; derive class‑probability calibration using OOB Platt scaling vs isotonic—justify when each is preferable. (d) Severe class imbalance: positive rate ≈1%; incorporate a 10× cost for FN vs FP into split selection and sampling (class_weight or stratified bootstrap). (e) Parallelization: design thread‑safe data structures and quantify training/inference complexity and memory; estimate wall‑clock time on 8 cores. (f) Streaming/warm‑start: add trees over time without retraining existing ones; discuss concept‑drift detection with OOB metrics. Provide clear pseudocode for train(), predict_proba(), and OOB evaluation, and justify all design choices.

This question evaluates expertise in implementing and engineering a memory- and compute-efficient Random Forest for binary classification, covering ensemble methods, CART/Gini impurity, OOB evaluation and calibration, class-imbalance handling, reproducibility, parallel and thread-safe system design, and streaming/warm-start model updates.

How do I approach Machine Learning interview questions?

Machine Learning questions require understanding of core concepts and practice. PracHub provides solutions with explanations to help you master machine learning interviews.

What difficulty level is this interview question?

This is a hard difficulty Machine Learning question, commonly asked during Onsite rounds at Apple.

What role is this question designed for?

This question is commonly asked for Data Scientist candidates at Apple during technical interviews.

Implement random forest with OOB and imbalance

Implement a Memory-Efficient Random Forest (Binary Classification) Under Constraints

You are asked to design and implement a Random Forest for binary classification under the following constraints. Assume a dataset with N = 200,000 rows and d = 100 features (a mix of numerical and high-cardinality categorical), and a total memory budget of 2 GB. Your design should be robust enough for a production environment and suitable for an onsite interview discussion.

Requirements

Trees

Use CART with Gini impurity.
Hyperparameters: max_depth, min_samples_leaf.
Missing values: either surrogate splits or median/mode imputation.
Categorical splits: support directly (no one-hot encoding).

Bagging and Feature Bagging

Bootstrap sampling per tree (bagging).
Feature subsampling per node with m_try = ⌊√d⌋.
Deterministic reproducibility with per-tree seeds.

Out-of-Bag (OOB) Evaluation

Compute OOB ROC-AUC, PR-AUC, and a reliability diagram.
Calibrate class probabilities using OOB predictions: compare Platt scaling vs isotonic regression; justify when each is preferable.

Severe Class Imbalance

Positive rate ≈ 1%.
Incorporate a 10× misclassification cost for FN vs FP into both split selection and sampling (e.g., class_weight or stratified bootstrap).

Parallelization and Systems Aspects

Design thread-safe data structures.
Quantify training and inference time complexity and memory usage.
Estimate wall-clock time on 8 cores.

Streaming / Warm-Start

Support adding trees over time without retraining existing trees.
Discuss concept-drift detection using OOB metrics.

Deliverables

Clear pseudocode for train(), predict_proba(), and OOB evaluation.
Justify all design choices.

Assume m_try = ⌊√100⌋ = 10. If information is missing (e.g., exact numeric vs categorical split), make minimal, explicit assumptions to complete the design.

Implement a Memory-Efficient Random Forest (Binary Classification) Under Constraints

Requirements

Trees

Use CART with Gini impurity.
Hyperparameters: max_depth, min_samples_leaf.
Missing values: either surrogate splits or median/mode imputation.
Categorical splits: support directly (no one-hot encoding).

Bagging and Feature Bagging

Bootstrap sampling per tree (bagging).
Feature subsampling per node with m_try = ⌊√d⌋.
Deterministic reproducibility with per-tree seeds.

Out-of-Bag (OOB) Evaluation

Compute OOB ROC-AUC, PR-AUC, and a reliability diagram.
Calibrate class probabilities using OOB predictions: compare Platt scaling vs isotonic regression; justify when each is preferable.

Severe Class Imbalance

Positive rate ≈ 1%.
Incorporate a 10× misclassification cost for FN vs FP into both split selection and sampling (e.g., class_weight or stratified bootstrap).

Parallelization and Systems Aspects

Design thread-safe data structures.
Quantify training and inference time complexity and memory usage.
Estimate wall-clock time on 8 cores.

Streaming / Warm-Start

Support adding trees over time without retraining existing trees.
Discuss concept-drift detection using OOB metrics.

Deliverables

Clear pseudocode for train(), predict_proba(), and OOB evaluation.
Justify all design choices.

Assume m_try = ⌊√100⌋ = 10. If information is missing (e.g., exact numeric vs categorical split), make minimal, explicit assumptions to complete the design.

Implement random forest with OOB and imbalance

Quick Overview

Implement a Memory-Efficient Random Forest (Binary Classification) Under Constraints

Requirements

Solution

Comments (0)

Implement random forest with OOB and imbalance

Quick Overview

Implement a Memory-Efficient Random Forest (Binary Classification) Under Constraints

Requirements

Solution

Comments (0)