Explain attention and Transformers

Q: Explain attention and Transformers

This is a Machine Learning interview question from Amazon for Software Engineer roles. View the full question and solution on PracHub.

Q: How do I approach Machine Learning interview questions?

Machine Learning questions require understanding of core concepts and practice. PracHub provides solutions with explanations to help you master machine learning interviews.

Question

Scaled Dot-Product Self-Attention, Transformer Architecture, and BERT vs GPT

You are interviewing for a software engineer role focused on machine learning. Explain the core math and design choices behind Transformers and how they translate to practical trade-offs in transfer learning and inference.

1) Scaled Dot-Product Self-Attention

Derive and define the following:

Queries (Q), Keys (K), Values (V) and how they are computed from inputs
The scaling factor and why it is needed
Masking (padding and causal)
Softmax over attention logits
Time and memory complexity (including multi-head and autoregressive decoding)

2) Transformer Architecture

Explain the encoder–decoder Transformer architecture, including:

Encoder vs decoder stacks and their sublayers
Residual connections and the normalization strategy (pre-norm vs post-norm)
Positional encoding (sinusoidal and alternatives)

3) BERT vs GPT

Compare BERT and GPT in terms of:

Pretraining objectives
Architectural differences
Typical downstream usage
How these choices affect transfer learning and inference behavior/performance

Explain attention and Transformers

Scaled Dot-Product Self-Attention, Transformer Architecture, and BERT vs GPT

1) Scaled Dot-Product Self-Attention

2) Transformer Architecture

3) BERT vs GPT

Solution (Locked)

Comments (0)