NumPy-only implementation: R² and PCA (Data Scientist take-home)
Implement from scratch using only NumPy (no scikit-learn). Use float64 throughout and clearly show numeric results where requested.
(a) r2_score(y_true, y_pred)
Write a function r2_score(y_true, y_pred) that:
-
Returns 1.0 if predictions are exactly equal to y_true (elementwise equality).
-
If var(y_true) = 0 (i.e., all y_true are identical) and predictions are not perfect, return -inf.
-
Otherwise compute R² as 1 − SS_res/SS_tot, where:
-
SS_res = sum((y_true − y_pred)²),
-
SS_tot = sum((y_true − mean(y_true))²),
-
Guard against division by zero using the rules above.
Test on:
-
y_true = [3, -1, 2, 7, 5], y_pred = [2.5, -0.5, 2.1, 7.8, 5.2]
-
Edge cases with y_true = [4, 4, 4, 4]:
-
y_pred = [4, 4, 4, 4]
-
y_pred = [4, 4, 5, 3]
Print the numeric outputs and briefly explain each.
(b) PCA on raw vs standardized features
Given the 6×3 matrix X:
X = [[10, 200, 0.50],
[12, 220, 0.40],
[ 9, 210, 0.55],
[11, 230, 0.60],
[ 8, 190, 0.45],
[13, 240, 0.65]]
Compute PCA twice:
-
On raw X (center columns by their mean before covariance).
-
On standardized X (column-wise zero-mean, unit-variance; use sample std with ddof=1), then compute PCA on that standardized matrix.
For each case:
-
Center appropriately, compute the covariance matrix S = (X_centered^T X_centered)/(n−1).
-
Obtain eigenvalues/eigenvectors (use np.linalg.eigh), sort by eigenvalue descending.
-
Report explained_variance_ratio for the first two components.
-
Print the first principal component vector (the eigenvector for the largest eigenvalue; note that sign is arbitrary).
Discuss:
-
How scaling (standardizing) changes the principal components and their explained variance.
-
Why eigenvector signs may flip without changing the subspace.