You are given a tabular dataset as a pandas DataFrame df with:
-
Feature columns (numeric and/or categorical)
-
A target column
y
(either binary classification or continuous regression)
You may use pandas and numpy (and standard Python), and you may use Google for documentation, but you may not use AI assistants or high-level ML libraries (e.g., scikit-learn).
Tasks:
-
Data preparation
-
Handle missing values.
-
Encode categorical variables.
-
Split into train/validation (or implement cross-validation).
-
Standardize/normalize features when appropriate.
-
Modeling (from scratch)
-
Choose a reasonable baseline model (e.g., linear regression for regression; logistic regression for binary classification).
-
Implement training using numpy (e.g., gradient descent).
-
Implement prediction.
-
Evaluation
-
Pick suitable metrics (e.g., MSE/RMSE for regression; accuracy/precision/recall/F1/AUC for classification).
-
Explain how you would detect overfitting and what you would do about it.
-
Concept questions
(be prepared to explain)
-
Bias–variance tradeoff
-
Regularization (L1 vs L2) and how it changes the objective
-
Class imbalance handling
-
Feature scaling: when it matters and why
-
Train/validation/test leakage and how to avoid it