Adam vs SGD Optimization
Asked of: Machine Learning Engineer
Last updated
What's being tested
Interviewers are probing whether you can reason about optimizer choice and configuration as an engineering tradeoff: convergence speed, stability, generalization, and operational cost in production training pipelines. They want an ML Engineer who can pick and tune an optimizer for large models, explain why a choice was made, and surface implementation and reproducibility pitfalls that affect deployment and monitoring.
Core knowledge
-
Stochastic Gradient Descent (SGD) update: ; with momentum: , . Momentum reduces oscillation and accelerates along consistent gradients.
-
Adam maintains adaptive first and second moments: , , with bias-corrected and update . Default betas are (0.9,0.999).
-
AdamW (decoupled weight decay) separates L2 penalty from optimizer update; weight decay should be applied as per step, not via L2 added to gradients — this avoids implicit learning-rate-dependent regularization.
-
Generalization vs convergence: adaptive methods (Adam) often converge faster on training loss and are robust to learning-rate scale, but
SGD+momentumfrequently yields better generalization on vision and recommendation tasks when properly tuned. -
Learning-rate schedules: common choices are linear warmup, cosine decay, step decay, and polynomial decay. Warmup (few hundred–thousand steps) prevents early instability with large models and Adam pretraining; step decay often used for
SGDfine-tuning. -
Batch size scaling: use the linear scaling rule when increasing batch size: scale learning rate batch_size; but beyond large-batch regime, generalization can degrade unless you adjust optimizer (e.g., LARS/LAMB for huge batches).
-
Gradient clipping & mixed precision: clip-by-norm stabilizes training; mixed precision with
float16saves memory and speeds up training but requires careful loss-scaling to avoid underflow in Adam's second-moment estimates. -
Distributed training & optimizer state: adaptive optimizers store per-parameter state (2x parameters for Adam: ). This increases checkpoint size and allreduce bandwidth; consider gradient accumulation to reduce communication frequency.
-
Checkpointing & reproducibility: save both model and optimizer state (including step count) to resume exact trajectories; mismatched LR schedule or missing state yields different behavior.
-
Hyperparameter sensitivity: Adam is less sensitive to initial ;
SGDrequires careful LR tuning and schedule. For both, tune weight decay and batch size jointly; monitor validation metrics, not just training loss. -
When to prefer which: use Adam for fast prototyping, unstable gradients, sparse/embeddings-heavy models, and large pretrained transformer pretraining; use
SGD+momentumfor long finetuning runs where final generalization is priority, especially for computer-vision backbones and large datasets. -
Practical knobs and defaults: common Adam epsilon affects stability; for
PyTorchtransformers, many use AdamW with warmup and cosine decay; forTensorFlowlarge-scale training, consider LAMB for huge batches.
Worked example — Explain self-attention, LoRA, Adam vs SGD, ViT
First 30s framing: ask what part of the pipeline matters (pretraining a Vision Transformer vs finetuning), datasets and compute limits, target metric (validation accuracy vs perceptual metric), and whether sparse/adaptive updates (like LoRA) are allowed. Skeleton of a strong answer: (1) briefly explain mechanistic differences between Adam and SGD+momentum; (2) map those mechanics to the ViT lifecycle — pretraining vs finetuning; (3) give exact config recommendations (AdamW + warmup + cosine for pretraining; SGD+momentum + warm restart or step decay for finetuning), and (4) deployment/operational notes (checkpointing optimizer state, mixed precision). Key tradeoff to call out: Adam speeds early convergence and tolerates higher LR/warmup but can converge to sharper minima that hurt downstream generalization; SGD often needs longer training but tends to reach flatter minima. Close by stating next steps: run a small ablation (AdamW vs SGD on a held-out split), sweep LR/weight-decay, and if time permits, measure sharpness (Hessian trace approx) or validate on production-like eval set.
A second angle
Consider a large-scale recommendation model with sparse embedding tables and heavy categorical features. The same optimizer tradeoffs apply but the constraints shift: embeddings are sparse and update patterns favor Adam/Adagrad variants that adapt per-parameter learning rates, improving convergence on rarely-updated embeddings. However, per-embedding optimizer state explodes memory; you may use stateless SGD for embeddings or shard optimizer state across parameter servers. Also, for recommender training on streaming data, latency and checkpoint frequency matter — prefer optimizers that tolerate stale gradients or use periodic Adam-to-SGD transitions to combine fast convergence with better long-term generalization.
Common pitfalls
Pitfall: Treating L2 regularization and weight decay as interchangeable.
L2 added to the loss is not identical to decoupled weight decay used with Adam; using the wrong form changes effective regularization and interacts with learning rate. Always apply AdamW style decay when using decoupled optimizers.
Pitfall: Relying solely on training loss to compare optimizers.
Adam often gives lower training loss faster; if you don't check validation/generalization, you'll pick an optimizer that underperforms in production. Report downstream metrics and out-of-sample performance.
Pitfall: Forgetting to save optimizer state and step counters in checkpoints.
Resuming without optimizer state or with mismatched LR schedule produces a different optimization trajectory and invalidates comparisons or retraining experiments.
Connections
Interviewers may pivot to learning-rate scheduling (warmup, cosine), large-batch optimizers like LARS/LAMB, or to distributed training topics such as gradient accumulation, all-reduce strategies, and optimizer state sharding. They may also ask about regularization interactions (batchnorm, dropout) and transfer-learning practices.
Further reading
-
Adam: A Method for Stochastic Optimization (Kingma & Ba) — original paper explaining moments & bias correction.
-
Decoupled Weight Decay Regularization (Loshchilov & Hutter) — AdamW — why weight decay must be handled separately.
-
Accurate, Large Minibatch
SGD: Training ImageNet in 1 Hour (Goyal et al.) — linear scaling rule and warmup for large-batch training.
Practice questions
Related concepts
- RLHF And Preference Optimization Basics
- Recommender, Ranking, And Ads ML Systems
- Supervised ML, Imbalance, Overfitting, And OptimizationMachine Learning
- Ads Ranking And Auction-Aware ML
- Ads Revenue, Auction, And Business TradeoffsAnalytics & Experimentation
- Shop Ads And Shopping MeasurementAnalytics & Experimentation