Task: Batch Gradient Descent for Linear Regression (with Intercept)
You are interviewing for a Data Scientist role and are asked to implement batch gradient descent to fit a linear regression model with an intercept using mean squared error (MSE).
Given
-
Design matrix X ∈ R^{m×n} that already includes an intercept column of ones (i.e., X[:, 0] = 1).
-
Target vector y ∈ R^{m}.
-
Learning rate α > 0.
-
max_iters (maximum number of iterations).
-
tolerance (for convergence checks).
Requirements
-
Derive and state the gradient for the objective:
-
Objective: J(θ) = (1/(2m)) ||Xθ − y||².
-
Show that ∇J(θ) = (1/m) Xᵀ (Xθ − y), and implement the update θ ← θ − α ∇J(θ).
-
Provide vectorized, Python-like pseudocode for batch gradient descent that:
-
Uses only matrix/vector operations (no loops over samples).
-
Exposes inputs X, y, α, max_iters, tolerance.
-
Outputs θ ∈ R^{n} and cost_history (list of J values per iteration).
-
Stopping criteria: implement at least two, e.g.,
-
max_iters, and either:
-
relative cost decrease below tolerance, or
-
||∇J(θ)||₂ below tolerance.
Describe when each is preferable.
-
Learning rate: discuss
-
Constant vs time-decayed schedules.
-
How to pick α to avoid divergence.
-
How feature scaling/standardization affects convergence and the condition number of XᵀX.
-
Regularization (L2): modify
-
J(θ) = (1/(2m))||Xθ − y||² + (λ/(2m))||θ_{1:}||² (exclude the intercept from penalty).
-
Provide the new gradient and update rule, and discuss λ selection.