Explain attention and Transformers

Q: Explain attention and Transformers

This question evaluates understanding of scaled dot-product self-attention, Transformer encoder–decoder architecture (including positional encodings, residuals and normalization), distinctions between BERT and GPT pretraining/usage, and competencies in attention math, masking, and time/memory complexity and transfer-learning/inference trade-offs.

Q: How do I approach Machine Learning interview questions?

Machine Learning questions require understanding of core concepts and practice. PracHub provides solutions with explanations to help you master machine learning interviews.

Question

Scaled Dot-Product Self-Attention, Transformer Architecture, and BERT vs GPT

You are interviewing for a software engineer role focused on machine learning. Explain the core math and design choices behind Transformers and how they translate to practical trade-offs in transfer learning and inference.

1) Scaled Dot-Product Self-Attention

Derive and define the following:

Queries (Q), Keys (K), Values (V) and how they are computed from inputs
The scaling factor and why it is needed
Masking (padding and causal)
Softmax over attention logits
Time and memory complexity (including multi-head and autoregressive decoding)

2) Transformer Architecture

Explain the encoder–decoder Transformer architecture, including:

Encoder vs decoder stacks and their sublayers
Residual connections and the normalization strategy (pre-norm vs post-norm)
Positional encoding (sinusoidal and alternatives)

3) BERT vs GPT

Compare BERT and GPT in terms of:

Pretraining objectives
Architectural differences
Typical downstream usage
How these choices affect transfer learning and inference behavior/performance

Explain attention and Transformers

Scaled Dot-Product Self-Attention, Transformer Architecture, and BERT vs GPT

1) Scaled Dot-Product Self-Attention

2) Transformer Architecture

3) BERT vs GPT

Solution

Comments (0)

Explain attention and Transformers

Overview

Scaled Dot-Product Self-Attention, Transformer Architecture, and BERT vs GPT

1) Scaled Dot-Product Self-Attention

2) Transformer Architecture

3) BERT vs GPT

Solution

Comments (0)