Binary Classification with Logistic Regression and Regularization
Data
-
Two CSVs: a training set
x
and a test set
x_test
.
-
Each has 7 columns:
-
Column 1: binary outcome y ∈ {0, 1}.
-
Columns 2–7: six continuous features f1–f6.
Task
Using Python (scikit-learn) or R (glmnet), do the following:
-
Load data
-
X_train = columns 2–7 of
x
; y_train = column 1 of
x
.
-
X_test = columns 2–7 of
x_test
; y_test = column 1 of
x_test
.
-
Basic diagnostics
-
Check missingness by column.
-
Provide summary statistics for features.
-
Standardize features.
-
Compute pairwise feature correlations; report any |r| ≥ 0.8.
-
Modeling
-
Fit a baseline logistic regression (no regularization).
-
Fit regularized models with L1 (Lasso) and L2 (Ridge) penalties.
-
Use cross-validation to select penalty strength (C in scikit-learn or lambda in glmnet).
-
Evaluation on held-out test set
-
Report ROC AUC (required).
-
Report at least one threshold-dependent metric (e.g., F1) at the threshold that maximizes Youden’s J, selected on validation folds.
-
Model comparison and interpretation
-
Compare baseline, L1, and L2 models and justify the chosen regularization.
-
Identify the most important features:
-
L1: non-zero coefficients.
-
L2: features with largest absolute standardized coefficients.
-
Outputs
-
Final predicted probabilities for X_test.
-
Confusion matrix on X_test at the selected threshold.