You are given partially implemented code and must complete key functions. Implement the missing parts with correct logic and reasonable efficiency.
Task A) Decision tree split function (classification)
You are building a binary decision tree classifier for tabular numeric features.
Implement:
-
gini(y)
that computes the Gini impurity of labels
y
(binary labels 0/1).
-
best_split(X, y)
that finds the best
(feature_index, threshold)
to split on.
Definitions/requirements:
-
X
is an
n x d
matrix of floats.
-
y
is length
n
, values in
{0,1}
.
-
A split is defined by
(j, t)
meaning:
-
left child: samples with
X[i][j] <= t
-
right child: samples with
X[i][j] > t
-
Choose the split that
minimizes weighted Gini impurity
:
Gsplit=nnLG(yL)+nnRG(yR)
-
If a split would produce an empty side, it is invalid.
-
Return
None
if no valid split exists.
Constraints:
-
1 <= n <= 2e4
,
1 <= d <= 50
-
Aim for ~
O(ndlogn)
or better.
Task B) Single-step gradient descent update (linear regression)
You are fitting linear regression with mean squared error:
L(w,b)=n1∑i=1n(w⊤xi+b−yi)2
Implement:
-
gd_step(X, y, w, b, lr)
that performs
one
gradient descent update and returns
(new_w, new_b)
.
Requirements:
-
X
is
n x d
,
y
length
n
(real-valued).
-
w
length
d
,
b
scalar.
-
Use the correct gradients for MSE (include the averaging over
n
).
-
Must work for
n=1
and
d=1
.
Notes
-
You may assume Python-like pseudocode and basic numeric operations (no need for full training loop).
-
Handle edge cases cleanly (constant labels, identical feature values, etc.).