Explain Transformers, activations, and training optimization

Q: Explain Transformers, activations, and training optimization

This is a Machine Learning interview question from DRW for Machine Learning Engineer roles. View the full question and solution on PracHub.

Q: How do I approach Machine Learning interview questions?

Machine Learning questions require understanding of core concepts and practice. PracHub provides solutions with explanations to help you master machine learning interviews.

Question

Modern Deep Learning: Conceptual Questions (ML Engineer Take-home)

You are preparing for a Machine Learning Engineer take-home. Answer the following conceptual questions concisely but precisely, providing formulas and key intuitions.

Derive the scaled dot-product self-attention computation and explain why scaling is needed.
Compare pre-LN vs post-LN Transformer blocks and their impact on training stability.
Justify when to use ReLU, GELU, or SiLU and their effects on gradient flow.
Explain positional encodings (sinusoidal vs learned) and how they influence extrapolation.
Describe common regularization methods in Transformers (dropout, label smoothing) and when to apply them.
Discuss optimizer and learning-rate scheduling choices (AdamW, warmup, cosine decay) and their rationale.
Explain gradient clipping and mixed-precision training trade-offs.
Identify causes of training divergence and mitigation strategies.
Compare cross-attention and self-attention and when to use each.
Explain how attention masking works for causal vs bidirectional models.

Explain Transformers, activations, and training optimization

Modern Deep Learning: Conceptual Questions (ML Engineer Take-home)

Solution (Locked)

Comments (0)