Train and evaluate logistic model with regularization
Company: Atlassian
Role: Data Scientist
Category: Machine Learning
Difficulty: medium
Interview Round: Technical Screen
You have two CSVs: training set x and test set x_test. Each has 7 columns: the first is the binary outcome y ∈ {0,1}; the next 6 are continuous features f1–f6. Using either Python (scikit-learn) or R (glmnet), write code to: (1) Load data (X_train = columns 2–7 of x; y_train = column 1; X_test = columns 2–7 of x_test; y_test = column 1). (2) Perform basic diagnostics: missingness, summary stats, feature scaling, and a correlation/collinearity check (report any |r| ≥ 0.8). (3) Fit a baseline logistic regression, then regularized models with L1 (Lasso) and L2 (Ridge). Use cross-validation to select the penalty strength (C in sklearn or lambda in glmnet). (4) Report metrics on the held-out test set: ROC AUC (required) and at least one threshold-dependent metric (e.g., F1 at the threshold maximizing Youden’s J on the validation fold). (5) Compare models, justify the chosen regularization, and identify the most important features (non-zero for L1 or largest absolute standardized coefficients for L2). (6) Provide the final predicted probabilities for X_test and a confusion matrix at your selected threshold.
Quick Answer: This question evaluates proficiency in supervised binary classification with logistic regression, implementation of L1 and L2 regularization, data preprocessing (missingness checks and standardization), hyperparameter selection via cross-validation, and model evaluation using ROC AUC and threshold-dependent metrics.