How do I approach Machine Learning interview questions?

Machine Learning questions require understanding of core concepts and practice. PracHub provides solutions with explanations to help you master machine learning interviews.

What difficulty level is this interview question?

This is a hard difficulty Machine Learning question, commonly asked during Onsite rounds at Adobe.

What role is this question designed for?

This question is commonly asked for Machine Learning Engineer candidates at Adobe during technical interviews.

Explain Transformer Attention Fundamentals

Q: Explain Transformer Attention Fundamentals

This question evaluates understanding of Transformer attention mechanics, encoder/decoder distinctions, bidirectional versus causal self-attention, cross-attention and key-value caching, multi-head attention internals, computational complexity, mixed-precision training trade-offs (FP16/BF16/FP32), and large language model training, alignment, and evaluation workflows—assessing competencies in model architecture, numerical precision, memory/compute optimization, and alignment methodology. It is commonly asked in Machine Learning interviews to probe reasoning about architectural and implementation trade-offs, performance and memory complexity, and evaluation/alignment risks; the category is Machine Learning and the level of abstraction spans both conceptual understanding and practical application.

In a machine learning fundamentals interview, explain the core mechanics of Transformer models and modern large language model training. Address the following topics:

Compare encoder-only, decoder-only, and encoder-decoder Transformers.
Explain bidirectional self-attention versus causal masked self-attention, and relate them to semantic understanding and autoregressive generation.
Explain how encoder-decoder cross-attention works.
Explain why encoder models generally do not need a key-value cache.
Walk through the full operation order inside multi-head attention, including Q/K/V projections, tensor reshaping, attention score computation, masking, softmax, value aggregation, head concatenation, and output projection.
Explain why multi-head attention usually has four linear projections.
Analyze the computational complexity of self-attention and explain why it becomes quadratic in sequence length.
Explain how a key-value cache works during autoregressive inference, why K and V are cached but Q is not, why it helps decoder-only models, and what memory tradeoffs it introduces.
Compare FP16, BF16, and FP32 for mixed precision training, including tensor core acceleration, loss scaling, FP32 master weights, and why BF16 is often preferred for large language models.
Explain common large language model training, alignment, and evaluation workflows, including direct preference optimization, teacher-student distillation, offline teacher generation with online student serving, LLM-as-a-judge evaluation, structured rubrics, offline versus online metrics, reward hacking, evaluator bias, alignment drift, and hallucination regression.

In a machine learning fundamentals interview, explain the core mechanics of Transformer models and modern large language model training. Address the following topics:

Compare encoder-only, decoder-only, and encoder-decoder Transformers.
Explain bidirectional self-attention versus causal masked self-attention, and relate them to semantic understanding and autoregressive generation.
Explain how encoder-decoder cross-attention works.
Explain why encoder models generally do not need a key-value cache.
Walk through the full operation order inside multi-head attention, including Q/K/V projections, tensor reshaping, attention score computation, masking, softmax, value aggregation, head concatenation, and output projection.
Explain why multi-head attention usually has four linear projections.
Analyze the computational complexity of self-attention and explain why it becomes quadratic in sequence length.
Explain how a key-value cache works during autoregressive inference, why K and V are cached but Q is not, why it helps decoder-only models, and what memory tradeoffs it introduces.
Compare FP16, BF16, and FP32 for mixed precision training, including tensor core acceleration, loss scaling, FP32 master weights, and why BF16 is often preferred for large language models.
Explain common large language model training, alignment, and evaluation workflows, including direct preference optimization, teacher-student distillation, offline teacher generation with online student serving, LLM-as-a-judge evaluation, structured rubrics, offline versus online metrics, reward hacking, evaluator bias, alignment drift, and hallucination regression.

Explain Transformer Attention Fundamentals

Quick Overview

Solution

Submit Your Answer

Explain Transformer Attention Fundamentals

Quick Overview

Solution

Submit Your Answer