Implement random forest with OOB and imbalance

Q: Implement random forest with OOB and imbalance

This is a Machine Learning interview question from Apple for Data Scientist roles. View the full question and solution on PracHub.

Q: How do I approach Machine Learning interview questions?

Machine Learning questions require understanding of core concepts and practice. PracHub provides solutions with explanations to help you master machine learning interviews.

Question

Implement a Memory-Efficient Random Forest (Binary Classification) Under Constraints

You are asked to design and implement a Random Forest for binary classification under the following constraints. Assume a dataset with N = 200,000 rows and d = 100 features (a mix of numerical and high-cardinality categorical), and a total memory budget of 2 GB. Your design should be robust enough for a production environment and suitable for an onsite interview discussion.

Requirements

Trees

Use CART with Gini impurity.
Hyperparameters: max_depth, min_samples_leaf.
Missing values: either surrogate splits or median/mode imputation.
Categorical splits: support directly (no one-hot encoding).

Bagging and Feature Bagging

Bootstrap sampling per tree (bagging).
Feature subsampling per node with m_try = ⌊√d⌋.
Deterministic reproducibility with per-tree seeds.

Out-of-Bag (OOB) Evaluation

Compute OOB ROC-AUC, PR-AUC, and a reliability diagram.
Calibrate class probabilities using OOB predictions: compare Platt scaling vs isotonic regression; justify when each is preferable.

Severe Class Imbalance

Positive rate ≈ 1%.
Incorporate a 10× misclassification cost for FN vs FP into both split selection and sampling (e.g., class_weight or stratified bootstrap).

Parallelization and Systems Aspects

Design thread-safe data structures.
Quantify training and inference time complexity and memory usage.
Estimate wall-clock time on 8 cores.

Streaming / Warm-Start

Support adding trees over time without retraining existing trees.
Discuss concept-drift detection using OOB metrics.

Deliverables

Clear pseudocode for train(), predict_proba(), and OOB evaluation.
Justify all design choices.

Assume m_try = ⌊√100⌋ = 10. If information is missing (e.g., exact numeric vs categorical split), make minimal, explicit assumptions to complete the design.

Implement random forest with OOB and imbalance

Implement a Memory-Efficient Random Forest (Binary Classification) Under Constraints

Requirements

Solution (Locked)

Comments (0)