Adam vs SGD Optimization

What's being tested

Interviewers are probing whether you can reason about optimizer choice and configuration as an engineering tradeoff: convergence speed, stability, generalization, and operational cost in production training pipelines. They want an ML Engineer who can pick and tune an optimizer for large models, explain why a choice was made, and surface implementation and reproducibility pitfalls that affect deployment and monitoring.

Core knowledge

Stochastic Gradient Descent (SGD) update: $\theta_{t+1}=\theta_t-\eta \nabla_\theta L(\theta_t)$ ; with momentum: $v_t=\mu v_{t-1}+\nabla L$ , $\theta_{t+1}=\theta_t-\eta v_t$ . Momentum reduces oscillation and accelerates along consistent gradients.
Adam maintains adaptive first and second moments: $m_t=\beta_1 m_{t-1}+(1-\beta_1)g_t$ , $v_t=\beta_2 v_{t-1}+(1-\beta_2)g_t^2$ , with bias-corrected $\hat m_t,\hat v_t$ and update $\theta_{t+1}=\theta_t-\eta \frac{\hat m_t}{\sqrt{\hat v_t}+\epsilon}$ . Default betas are (0.9,0.999).
AdamW (decoupled weight decay) separates L2 penalty from optimizer update; weight decay should be applied as $\theta\leftarrow\theta(1-\eta\lambda)$ per step, not via L2 added to gradients — this avoids implicit learning-rate-dependent regularization.
Generalization vs convergence: adaptive methods (Adam) often converge faster on training loss and are robust to learning-rate scale, but SGD+momentum frequently yields better generalization on vision and recommendation tasks when properly tuned.
Learning-rate schedules: common choices are linear warmup, cosine decay, step decay, and polynomial decay. Warmup (few hundred–thousand steps) prevents early instability with large models and Adam pretraining; step decay often used for SGD fine-tuning.
Batch size scaling: use the linear scaling rule when increasing batch size: scale learning rate $\eta \propto$ batch_size; but beyond large-batch regime, generalization can degrade unless you adjust optimizer (e.g., LARS/LAMB for huge batches).
Gradient clipping & mixed precision: clip-by-norm stabilizes training; mixed precision with float16 saves memory and speeds up training but requires careful loss-scaling to avoid underflow in Adam's second-moment estimates.
Distributed training & optimizer state: adaptive optimizers store per-parameter state (2x parameters for Adam: $m,v$ ). This increases checkpoint size and allreduce bandwidth; consider gradient accumulation to reduce communication frequency.
Checkpointing & reproducibility: save both model and optimizer state (including step count) to resume exact trajectories; mismatched LR schedule or missing state yields different behavior.
Hyperparameter sensitivity: Adam is less sensitive to initial $\eta$ ; SGD requires careful LR tuning and schedule. For both, tune weight decay and batch size jointly; monitor validation metrics, not just training loss.
When to prefer which: use Adam for fast prototyping, unstable gradients, sparse/embeddings-heavy models, and large pretrained transformer pretraining; use SGD+momentum for long finetuning runs where final generalization is priority, especially for computer-vision backbones and large datasets.
Practical knobs and defaults: common Adam epsilon $\epsilon\in[1e{-}8,1e{-}6]$ affects stability; for PyTorch transformers, many use AdamW with warmup and cosine decay; for TensorFlow large-scale training, consider LAMB for huge batches.

Worked example — Explain self-attention, LoRA, Adam vs SGD, ViT

First 30s framing: ask what part of the pipeline matters (pretraining a Vision Transformer vs finetuning), datasets and compute limits, target metric (validation accuracy vs perceptual metric), and whether sparse/adaptive updates (like LoRA) are allowed. Skeleton of a strong answer: (1) briefly explain mechanistic differences between Adam and SGD+momentum; (2) map those mechanics to the ViT lifecycle — pretraining vs finetuning; (3) give exact config recommendations (AdamW + warmup + cosine for pretraining; SGD+momentum + warm restart or step decay for finetuning), and (4) deployment/operational notes (checkpointing optimizer state, mixed precision). Key tradeoff to call out: Adam speeds early convergence and tolerates higher LR/warmup but can converge to sharper minima that hurt downstream generalization; SGD often needs longer training but tends to reach flatter minima. Close by stating next steps: run a small ablation (AdamW vs SGD on a held-out split), sweep LR/weight-decay, and if time permits, measure sharpness (Hessian trace approx) or validate on production-like eval set.

A second angle

Consider a large-scale recommendation model with sparse embedding tables and heavy categorical features. The same optimizer tradeoffs apply but the constraints shift: embeddings are sparse and update patterns favor Adam/Adagrad variants that adapt per-parameter learning rates, improving convergence on rarely-updated embeddings. However, per-embedding optimizer state explodes memory; you may use stateless SGD for embeddings or shard optimizer state across parameter servers. Also, for recommender training on streaming data, latency and checkpoint frequency matter — prefer optimizers that tolerate stale gradients or use periodic Adam-to-SGD transitions to combine fast convergence with better long-term generalization.

Common pitfalls

Pitfall: Treating L2 regularization and weight decay as interchangeable.

L2 added to the loss is not identical to decoupled weight decay used with Adam; using the wrong form changes effective regularization and interacts with learning rate. Always apply AdamW style decay when using decoupled optimizers.

Pitfall: Relying solely on training loss to compare optimizers.

Adam often gives lower training loss faster; if you don't check validation/generalization, you'll pick an optimizer that underperforms in production. Report downstream metrics and out-of-sample performance.

Pitfall: Forgetting to save optimizer state and step counters in checkpoints.

Resuming without optimizer state or with mismatched LR schedule produces a different optimization trajectory and invalidates comparisons or retraining experiments.

Connections

Interviewers may pivot to learning-rate scheduling (warmup, cosine), large-batch optimizers like LARS/LAMB, or to distributed training topics such as gradient accumulation, all-reduce strategies, and optimizer state sharding. They may also ask about regularization interactions (batchnorm, dropout) and transfer-learning practices.

What's being tested

Core knowledge

Worked example — Explain self-attention, LoRA, Adam vs SGD, ViT

A second angle

Common pitfalls

Connections

Further reading

Practice questions

Related concepts