Prove and apply statistical ML fundamentals
Company: Amazon
Role: Data Scientist
Category: Statistics & Math
Difficulty: hard
Interview Round: Technical Screen
Work through these statistical ML exercises with precise math and small computations. 1) From first principles, derive ordinary least squares for linear regression: model, assumptions, normal equations, closed‑form estimator, conditions for (XᵀX)^{-1} existence, and the ridge solution; explain bias–variance effects. 2) Logistic regression: write the negative log‑likelihood for binary labels, derive gradient and Hessian, and prove convexity. Then compute one explicit gradient step (no bias term) with learning rate 0.5 for x=(1,2), y=1, current weights w=(0.1,−0.2). 3) Overfitting: list three distinct mitigation techniques (e.g., regularization, early stopping, data augmentation) and explain when each helps or hurts; propose a cross‑validation plan to tune λ for L2. 4) Bootstrapping vs boosting: a) Bootstrapping—given sample values [2,3,5,7,11], describe the percentile‑interval procedure for the mean; show the first two bootstrap resamples you would draw (with replacement) and compute their means; explain why the bootstrap can estimate uncertainty without parametric assumptions. b) Boosting—explain the core idea (sequentially fitting to residuals or reweighted errors). Perform one AdaBoost step with three training points having initial weights (1/3 each) where the weak learner misclassifies only the second point: compute ε, α=½ ln((1−ε)/ε), the unnormalized updated weights, and the normalized distribution for the next round. 5) Compare bagging, boosting, and random forests in terms of bias, variance, and robustness to noisy labels; provide one scenario where each is preferable.
Quick Answer: This question evaluates mastery of statistical machine-learning fundamentals—linear and logistic regression derivations, regularization and bias–variance trade-offs, resampling (bootstrap), boosting algorithms, and ensemble comparisons—using precise mathematical reasoning and small numeric computations.