Implement a Memory-Efficient Random Forest (Binary Classification) Under Constraints
You are asked to design and implement a Random Forest for binary classification under the following constraints. Assume a dataset with N = 200,000 rows and d = 100 features (a mix of numerical and high-cardinality categorical), and a total memory budget of 2 GB. Your design should be robust enough for a production environment and suitable for an onsite interview discussion.
Requirements
-
Trees
-
Use CART with Gini impurity.
-
Hyperparameters: max_depth, min_samples_leaf.
-
Missing values: either surrogate splits or median/mode imputation.
-
Categorical splits: support directly (no one-hot encoding).
-
Bagging and Feature Bagging
-
Bootstrap sampling per tree (bagging).
-
Feature subsampling per node with m_try = ⌊√d⌋.
-
Deterministic reproducibility with per-tree seeds.
-
Out-of-Bag (OOB) Evaluation
-
Compute OOB ROC-AUC, PR-AUC, and a reliability diagram.
-
Calibrate class probabilities using OOB predictions: compare Platt scaling vs isotonic regression; justify when each is preferable.
-
Severe Class Imbalance
-
Positive rate ≈ 1%.
-
Incorporate a 10× misclassification cost for FN vs FP into both split selection and sampling (e.g., class_weight or stratified bootstrap).
-
Parallelization and Systems Aspects
-
Design thread-safe data structures.
-
Quantify training and inference time complexity and memory usage.
-
Estimate wall-clock time on 8 cores.
-
Streaming / Warm-Start
-
Support adding trees over time without retraining existing trees.
-
Discuss concept-drift detection using OOB metrics.
-
Deliverables
-
Clear pseudocode for train(), predict_proba(), and OOB evaluation.
-
Justify all design choices.
Assume m_try = ⌊√100⌋ = 10. If information is missing (e.g., exact numeric vs categorical split), make minimal, explicit assumptions to complete the design.