Explain Transformers, activations, and training optimization
Company: DRW
Role: Machine Learning Engineer
Category: Machine Learning
Difficulty: hard
Interview Round: Take-home Project
Answer conceptual questions on modern deep learning:
1) Derive the scaled dot-product self-attention computation and explain why scaling is needed;
2) Compare pre-LN vs post-LN Transformer blocks and their impact on training stability;
3) Justify when to use ReLU, GELU, or SiLU and their effects on gradient flow;
4) Explain positional encodings (sinusoidal vs learned) and how they influence extrapolation;
5) Describe common regularization methods in Transformers (dropout, label smoothing) and when to apply them;
6) Discuss optimizer and learning-rate scheduling choices (AdamW, warmup, cosine decay) and their rationale;
7) Explain gradient clipping and mixed-precision training trade-offs;
8) Identify causes of training divergence and mitigation strategies;
9) Compare cross-attention and self-attention and when to use each;
10) Explain how attention masking works for causal vs bidirectional models.
Quick Answer: Explain Transformers, activations, and training optimization evaluates core ML concepts, assumptions, math intuition, training/evaluation trade-offs, and practical failure modes in a realistic interview setting. A strong answer states assumptions, handles edge cases, explains trade-offs, and shows how to validate the result clearly.