PracHub
QuestionsPremiumCoachesLearningGuidesInterview Prep
|Home/Machine Learning/UiPath

Explain Transformer Layers and FFN Rationale

Last updated: Jun 15, 2026

Quick Overview

This question evaluates a candidate's understanding of the Transformer architecture: multi-head self-attention, residual connections and layer normalization (Pre-LN vs. Post-LN), and the position-wise feed-forward network and why it is needed. It also tests tensor-shape reasoning through Q/K/V projections and attention, derivation of the O(n²d + nd²) complexity, and implementation trade-offs such as MQA/GQA, positional-encoding schemes, and normalization choices.

  • medium
  • UiPath
  • Machine Learning
  • Machine Learning Engineer

Explain Transformer Layers and FFN Rationale

Company: UiPath

Role: Machine Learning Engineer

Category: Machine Learning

Difficulty: medium

Interview Round: Technical Screen

##### Question Explain the Transformer architecture in detail, then walk through the math step by step. 1. Describe the components of each **encoder** and **decoder** layer: multi-head (self-)attention, residual ("Add") connections, layer normalization, and the position-wise feed-forward network (FFN). How do residual connections and layer normalization interact (Pre-LN vs. Post-LN)? 2. Why is a position-wise FFN needed *after* attention? What does it add that attention alone cannot provide? 3. Walk through the vector/matrix computations with shapes: the Q/K/V projections, attention-score scaling and softmax, the weighted sum that forms the context vectors, concatenation of heads, output projection, residual pathways, layer norms, and the FFN's two linear layers with activation. Use a concrete config (e.g. `d_model = 512`, `h = 8`, sequence length `n`) and give shapes for Q, K, V, the attention scores, and the block output. 4. Derive the **computational complexity** of self-attention with respect to sequence length `n` and model dimension `d`, and note where memory dominates. 5. Discuss common **implementation choices and trade-offs**: Q/K/V projection layout (separate vs. fused, MQA/GQA), number of heads vs. per-head dimension, positional-encoding schemes (sinusoidal, learned, relative, RoPE, ALiBi), and FFN/normalization variants (GELU/SwiGLU, RMSNorm). 6. (Optional) Compare encoder vs. decoder layers, and describe how representations evolve across the stack of layers.

Quick Answer: This question evaluates a candidate's understanding of the Transformer architecture: multi-head self-attention, residual connections and layer normalization (Pre-LN vs. Post-LN), and the position-wise feed-forward network and why it is needed. It also tests tensor-shape reasoning through Q/K/V projections and attention, derivation of the O(n²d + nd²) complexity, and implementation trade-offs such as MQA/GQA, positional-encoding schemes, and normalization choices.

Related Interview Questions

  • Explain Core ML Fundamentals - UiPath (easy)
UiPath logo
UiPath
Aug 7, 2025, 12:00 AM
Machine Learning Engineer
Technical Screen
Machine Learning
5
0
Question

Explain the Transformer architecture in detail, then walk through the math step by step.

  1. Describe the components of each encoder and decoder layer: multi-head (self-)attention, residual ("Add") connections, layer normalization, and the position-wise feed-forward network (FFN). How do residual connections and layer normalization interact (Pre-LN vs. Post-LN)?
  2. Why is a position-wise FFN needed after attention? What does it add that attention alone cannot provide?
  3. Walk through the vector/matrix computations with shapes: the Q/K/V projections, attention-score scaling and softmax, the weighted sum that forms the context vectors, concatenation of heads, output projection, residual pathways, layer norms, and the FFN's two linear layers with activation. Use a concrete config (e.g. d_model = 512 , h = 8 , sequence length n ) and give shapes for Q, K, V, the attention scores, and the block output.
  4. Derive the computational complexity of self-attention with respect to sequence length n and model dimension d , and note where memory dominates.
  5. Discuss common implementation choices and trade-offs : Q/K/V projection layout (separate vs. fused, MQA/GQA), number of heads vs. per-head dimension, positional-encoding schemes (sinusoidal, learned, relative, RoPE, ALiBi), and FFN/normalization variants (GELU/SwiGLU, RMSNorm).
  6. (Optional) Compare encoder vs. decoder layers, and describe how representations evolve across the stack of layers.

Solution

Show

Submit Your Answer to Earn 20XP

Sign in to leave a comment

Loading comments...

Browse More Questions

More Machine Learning•More UiPath•More Machine Learning Engineer•UiPath Machine Learning Engineer•UiPath Machine Learning•Machine Learning Engineer Machine Learning
PracHub

Master your tech interviews with 8,000+ real questions from top companies.

Product

  • Questions
  • Learning Tracks
  • Interview Guides
  • Resources
  • Premium
  • For Universities
  • Student Access

Browse

  • By Company
  • By Role
  • By Category
  • Topic Hubs
  • SQL Questions
  • Compare Platforms
  • Discord Community

Support

  • support@prachub.com
  • (916) 541-4762

Legal

  • Privacy Policy
  • Terms of Service
  • About Us

© 2026 PracHub. All rights reserved.