PracHub
QuestionsPremiumCoachesLearningGuidesInterview Prep
|Home/Machine Learning/Amazon

Explain attention and Transformers

Last updated: Mar 29, 2026

Quick Overview

This question evaluates understanding of scaled dot-product self-attention, Transformer encoder–decoder architecture (including positional encodings, residuals and normalization), distinctions between BERT and GPT pretraining/usage, and competencies in attention math, masking, and time/memory complexity and transfer-learning/inference trade-offs.

  • hard
  • Amazon
  • Machine Learning
  • Software Engineer

Explain attention and Transformers

Company: Amazon

Role: Software Engineer

Category: Machine Learning

Difficulty: hard

Interview Round: Technical Screen

Derive the scaled dot-product self-attention: define Q, K, V, the scaling factor, masking, and softmax; discuss complexity. Explain the Transformer encoder/decoder architecture, residual connections, normalization strategy, and positional encoding. Compare BERT and GPT pretraining objectives, architectures, and typical downstream usage; discuss how these choices affect transfer and inference.

Quick Answer: This question evaluates understanding of scaled dot-product self-attention, Transformer encoder–decoder architecture (including positional encodings, residuals and normalization), distinctions between BERT and GPT pretraining/usage, and competencies in attention math, masking, and time/memory complexity and transfer-learning/inference trade-offs.

Related Interview Questions

  • Explain Transformer and MoE Fundamentals - Amazon (medium)
  • Explain Core ML Interview Concepts - Amazon (hard)
  • Evaluate NLP Classification Models - Amazon (easy)
  • Explain overfitting, regularization, and LLM techniques - Amazon (medium)
  • Explain NLP/RL concepts used in LLM agents - Amazon (hard)
Amazon logo
Amazon
Jul 15, 2025, 12:00 AM
Software Engineer
Technical Screen
Machine Learning
2
0

Scaled Dot-Product Self-Attention, Transformer Architecture, and BERT vs GPT

You are interviewing for a software engineer role focused on machine learning. Explain the core math and design choices behind Transformers and how they translate to practical trade-offs in transfer learning and inference.

1) Scaled Dot-Product Self-Attention

Derive and define the following:

  • Queries (Q), Keys (K), Values (V) and how they are computed from inputs
  • The scaling factor and why it is needed
  • Masking (padding and causal)
  • Softmax over attention logits
  • Time and memory complexity (including multi-head and autoregressive decoding)

2) Transformer Architecture

Explain the encoder–decoder Transformer architecture, including:

  • Encoder vs decoder stacks and their sublayers
  • Residual connections and the normalization strategy (pre-norm vs post-norm)
  • Positional encoding (sinusoidal and alternatives)

3) BERT vs GPT

Compare BERT and GPT in terms of:

  • Pretraining objectives
  • Architectural differences
  • Typical downstream usage
  • How these choices affect transfer learning and inference behavior/performance

Solution

Show

Submit Your Answer

Sign in to leave a comment

Loading comments...

Browse More Questions

More Machine Learning•More Amazon•More Software Engineer•Amazon Software Engineer•Amazon Machine Learning•Software Engineer Machine Learning
PracHub

Master your tech interviews with 8,500+ real questions from top companies.

Product

  • Questions
  • Learning Tracks
  • Interview Guides
  • Resources
  • Premium
  • For Universities
  • Student Access

Browse

  • By Company
  • By Role
  • By Category
  • Topic Hubs
  • SQL Questions
  • Compare Platforms
  • Discord Community

Support

  • support@prachub.com
  • (916) 541-4762

Legal

  • Privacy Policy
  • Terms of Service
  • About Us

© 2026 PracHub. All rights reserved.