Scaled Dot-Product Self-Attention, Transformer Architecture, and BERT vs GPT
You are interviewing for a software engineer role focused on machine learning. Explain the core math and design choices behind Transformers and how they translate to practical trade-offs in transfer learning and inference.
1) Scaled Dot-Product Self-Attention
Derive and define the following:
-
Queries (Q), Keys (K), Values (V) and how they are computed from inputs
-
The scaling factor and why it is needed
-
Masking (padding and causal)
-
Softmax over attention logits
-
Time and memory complexity (including multi-head and autoregressive decoding)
2) Transformer Architecture
Explain the encoder–decoder Transformer architecture, including:
-
Encoder vs decoder stacks and their sublayers
-
Residual connections and the normalization strategy (pre-norm vs post-norm)
-
Positional encoding (sinusoidal and alternatives)
3) BERT vs GPT
Compare BERT and GPT in terms of:
-
Pretraining objectives
-
Architectural differences
-
Typical downstream usage
-
How these choices affect transfer learning and inference behavior/performance