PracHub
QuestionsPremiumLearningGuidesCheatsheetNEWCoaches
|Home/Machine Learning/DRW

Explain Transformers, activations, and training optimization

Last updated: Mar 29, 2026

Quick Overview

This question evaluates understanding of Transformer architectures, activation functions, optimizer and scheduling choices, regularization, and training stability within the Machine Learning domain for Machine Learning Engineer roles.

  • hard
  • DRW
  • Machine Learning
  • Machine Learning Engineer

Explain Transformers, activations, and training optimization

Company: DRW

Role: Machine Learning Engineer

Category: Machine Learning

Difficulty: hard

Interview Round: Take-home Project

Answer conceptual questions on modern deep learning: 1) Derive the scaled dot-product self-attention computation and explain why scaling is needed; 2) Compare pre-LN vs post-LN Transformer blocks and their impact on training stability; 3) Justify when to use ReLU, GELU, or SiLU and their effects on gradient flow; 4) Explain positional encodings (sinusoidal vs learned) and how they influence extrapolation; 5) Describe common regularization methods in Transformers (dropout, label smoothing) and when to apply them; 6) Discuss optimizer and learning-rate scheduling choices (AdamW, warmup, cosine decay) and their rationale; 7) Explain gradient clipping and mixed-precision training trade-offs; 8) Identify causes of training divergence and mitigation strategies; 9) Compare cross-attention and self-attention and when to use each; 10) Explain how attention masking works for causal vs bidirectional models.

Quick Answer: This question evaluates understanding of Transformer architectures, activation functions, optimizer and scheduling choices, regularization, and training stability within the Machine Learning domain for Machine Learning Engineer roles.

Related Interview Questions

  • Explain core ML concepts - DRW (medium)
  • Build an imbalanced classification pipeline with sklearn - DRW (hard)
  • Explain core ML and DL fundamentals - DRW (medium)
DRW logo
DRW
Jul 29, 2025, 12:00 AM
Machine Learning Engineer
Take-home Project
Machine Learning
1
0

Modern Deep Learning: Conceptual Questions (ML Engineer Take-home)

You are preparing for a Machine Learning Engineer take-home. Answer the following conceptual questions concisely but precisely, providing formulas and key intuitions.

  1. Derive the scaled dot-product self-attention computation and explain why scaling is needed.
  2. Compare pre-LN vs post-LN Transformer blocks and their impact on training stability.
  3. Justify when to use ReLU, GELU, or SiLU and their effects on gradient flow.
  4. Explain positional encodings (sinusoidal vs learned) and how they influence extrapolation.
  5. Describe common regularization methods in Transformers (dropout, label smoothing) and when to apply them.
  6. Discuss optimizer and learning-rate scheduling choices (AdamW, warmup, cosine decay) and their rationale.
  7. Explain gradient clipping and mixed-precision training trade-offs.
  8. Identify causes of training divergence and mitigation strategies.
  9. Compare cross-attention and self-attention and when to use each.
  10. Explain how attention masking works for causal vs bidirectional models.

Solution

Show

Comments (0)

Sign in to leave a comment

Loading comments...

Browse More Questions

More Machine Learning•More DRW•More Machine Learning Engineer•DRW Machine Learning Engineer•DRW Machine Learning•Machine Learning Engineer Machine Learning
PracHub

Master your tech interviews with 7,500+ real questions from top companies.

Product

  • Questions
  • Learning Tracks
  • Interview Guides
  • Resources
  • Premium
  • For Universities
  • Student Access

Browse

  • By Company
  • By Role
  • By Category
  • Topic Hubs
  • SQL Questions
  • Compare Platforms
  • Discord Community

Support

  • support@prachub.com
  • (916) 541-4762

Legal

  • Privacy Policy
  • Terms of Service
  • About Us

© 2026 PracHub. All rights reserved.