This question evaluates understanding of scaled dot-product self-attention, Transformer encoder–decoder architecture (including positional encodings, residuals and normalization), distinctions between BERT and GPT pretraining/usage, and competencies in attention math, masking, and time/memory complexity and transfer-learning/inference trade-offs.
You are interviewing for a software engineer role focused on machine learning. Explain the core math and design choices behind Transformers and how they translate to practical trade-offs in transfer learning and inference.
Derive and define the following:
Explain the encoder–decoder Transformer architecture, including:
Compare BERT and GPT in terms of:
Login required