Modern Deep Learning: Conceptual Questions (ML Engineer Take-home)
You are preparing for a Machine Learning Engineer take-home. Answer the following conceptual questions concisely but precisely, providing formulas and key intuitions.
-
Derive the scaled dot-product self-attention computation and explain why scaling is needed.
-
Compare pre-LN vs post-LN Transformer blocks and their impact on training stability.
-
Justify when to use ReLU, GELU, or SiLU and their effects on gradient flow.
-
Explain positional encodings (sinusoidal vs learned) and how they influence extrapolation.
-
Describe common regularization methods in Transformers (dropout, label smoothing) and when to apply them.
-
Discuss optimizer and learning-rate scheduling choices (AdamW, warmup, cosine decay) and their rationale.
-
Explain gradient clipping and mixed-precision training trade-offs.
-
Identify causes of training divergence and mitigation strategies.
-
Compare cross-attention and self-attention and when to use each.
-
Explain how attention masking works for causal vs bidirectional models.