Write and explain gradient descent pseudocode
Company: Amazon
Role: Data Scientist
Category: Machine Learning
Difficulty: medium
Interview Round: Onsite
Write vectorized pseudocode (Python-like) for batch gradient descent to fit a linear regression model with intercept using mean squared error. Clearly explain each step and variable. Your pseudocode should expose inputs X ∈ R^{m×n} (assume an added intercept column), y ∈ R^{m}, learning_rate α, max_iters, tolerance, and should output θ ∈ R^{n} and cost_history.
Derive the update rule from J(θ) = (1/(2m)) ||Xθ − y||^2, show that ∇J(θ) = (1/m) X^T (Xθ − y), and implement θ ← θ − α ∇J(θ).
Then:
- Stopping criteria: implement at least two (max iters, relative cost decrease or ||∇J||_2 below tolerance) and describe when each is preferable.
- Learning rate: discuss constant vs time-decayed schedules; how to pick α to avoid divergence; how feature scaling/standardization affects convergence rate and the condition number of X^T X.
- Regularization: modify J(θ) to include L2 with λ (J(θ) = (1/(2m))||Xθ − y||^2 + (λ/(2m))||θ_{1:}||^2 where the intercept is excluded). Write the new gradient and discuss λ selection.
Quick Answer: This question evaluates a candidate's understanding of batch gradient descent for linear regression with an intercept, covering concepts such as mean squared error optimization, vectorized matrix operations, analytical gradient computation, learning rate schedules, stopping criteria, feature scaling, and L2 regularization.