How do I approach Machine Learning interview questions?

Machine Learning questions require understanding of core concepts and practice. PracHub provides solutions with explanations to help you master machine learning interviews.

What difficulty level is this interview question?

This is a easy difficulty Machine Learning question, commonly asked during Technical Screen rounds at Apple.

What role is this question designed for?

This question is commonly asked for Machine Learning Engineer candidates at Apple during technical interviews.

Implement Masked Multi-Head Self-Attention

Q: Implement Masked Multi-Head Self-Attention

This question evaluates implementation and conceptual understanding of masked multi-head self-attention, covering scaled dot-product attention, separate linear projections for queries/keys/values, head-wise tensor reshaping, and the construction and application of padding and causal masks.

Implement the core self-attention module used inside a Transformer encoder, from scratch, in a deep-learning framework of your choice (PyTorch-style pseudocode is fine).

Given an input tensor X of shape (batch_size, sequence_length, d_model), first implement scaled dot-product self-attention, then extend it to multi-head self-attention with num_heads heads. A complete answer must:

Learn separate linear projections for queries, keys, and values.
Split the projected tensors into num_heads independent heads.
Compute attention scores as $\frac{QK^\top}{\sqrt{d_k}}$ .
Apply an optional attention mask before the softmax so masked positions receive zero weight.
Apply softmax over the key dimension.
Multiply the attention weights by the values.
Concatenate all heads and apply a final output projection.

State the expected tensor shape at each major step, and explain how you would construct and apply both a padding mask and a causal (look-ahead) mask.

Constraints & Assumptions

d_model is divisible by num_heads ; define $d_k = d_{\text{model}} / \text{num\_heads}$ .
Self-attention: queries, keys, and values are all derived from the same input X (so the query, key, and value sequence lengths are equal to sequence_length ).
The mask is optional. When supplied it may be a padding mask of shape (batch_size, sequence_length) , a causal mask of shape (sequence_length, sequence_length) , or both combined.
Typical scale to keep in mind: d_model in the hundreds-to-thousands (e.g. 512), num_heads 8–16, sequence lengths up to a few thousand. Do not assume a fixed value—write the code parametrically.
Numerical stability matters: a masked-out logit must be driven low enough that softmax assigns it ~0 weight.

Clarifying Questions to Ask

Is this self -attention (Q, K, V from the same X ) or cross-attention (K, V from a different source)? This affects the projection inputs and the mask shapes.
What does the mask convention mean—does True / 1 indicate keep or mask out ? Is it a 0/1 tensor or already an additive -inf bias?
Should the module include dropout on the attention weights, and/or a scaling/bias term in the projections?
Do we need to support variable-length sequences in a batch (i.e. a real padding mask), or are all sequences the same length?
Are we optimizing for clarity/correctness, or should I also discuss a fused/flash-attention-style kernel and memory complexity?
Should the output keep shape (B, T, d_model) to slot directly into a residual + LayerNorm block?

What a Strong Answer Covers

Correct math : the $1/\sqrt{d_k}$ scaling and why it exists (controlling logit variance so softmax gradients don't vanish).
Shape bookkeeping : an explicit shape at every step—projection, head split, scores, masked scores, softmax, context, head merge, output projection.
Single-projection-then-split trick rather than per-head layers (efficiency and parameter parity).
Mask correctness : pre-softmax application, broadcasting a (B, T) padding mask, building a triangular causal mask, and combining the two.
Numerical care : large-negative fill instead of post-softmax zeroing; awareness of the fully-masked-row / all- -inf degenerate case.
Framework hygiene : .contiguous() before view , the assertion d_model % num_heads == 0 , output shape matching input.
Complexity awareness : the attention-matrix cost is $O(T^2 \cdot d_{\text{model}})$ time and $O(H \cdot T^2)$ memory, while the Q/K/V/output projections add $O(T \cdot d_{\text{model}}^2)$ ; knowing which term dominates in which $T$ -vs- $d_{\text{model}}$ regime, and where this becomes the bottleneck.

Follow-up Questions

How does this change for cross-attention (e.g. in a decoder), where keys/values come from the encoder output of a different length?
Attention is $O(T^2)$ in sequence length. What approaches reduce this (FlashAttention's tiling/recomputation, or linear/sparse attention), and what do they trade off?
Where do rotary or other relative position embeddings (RoPE, ALiBi) plug into this module, and why might they be preferred over absolute positional encodings added to X ?
An entire row of the attention logits is masked out (a fully-padded query position). What happens to the softmax, and how would you prevent NaN s from propagating?
Why is multi-head attention empirically better than a single head of the same total width, given the parameter count is roughly equal?

Implement the core self-attention module used inside a Transformer encoder, from scratch, in a deep-learning framework of your choice (PyTorch-style pseudocode is fine).

Learn separate linear projections for queries, keys, and values.
Split the projected tensors into num_heads independent heads.
Compute attention scores as $\frac{QK^\top}{\sqrt{d_k}}$ .
Apply an optional attention mask before the softmax so masked positions receive zero weight.
Apply softmax over the key dimension.
Multiply the attention weights by the values.
Concatenate all heads and apply a final output projection.

State the expected tensor shape at each major step, and explain how you would construct and apply both a padding mask and a causal (look-ahead) mask.

Constraints & Assumptions

d_model is divisible by num_heads ; define $d_k = d_{\text{model}} / \text{num\_heads}$ .
Self-attention: queries, keys, and values are all derived from the same input X (so the query, key, and value sequence lengths are equal to sequence_length ).
The mask is optional. When supplied it may be a padding mask of shape (batch_size, sequence_length) , a causal mask of shape (sequence_length, sequence_length) , or both combined.
Typical scale to keep in mind: d_model in the hundreds-to-thousands (e.g. 512), num_heads 8–16, sequence lengths up to a few thousand. Do not assume a fixed value—write the code parametrically.
Numerical stability matters: a masked-out logit must be driven low enough that softmax assigns it ~0 weight.

Clarifying Questions to Ask

Is this self -attention (Q, K, V from the same X ) or cross-attention (K, V from a different source)? This affects the projection inputs and the mask shapes.
What does the mask convention mean—does True / 1 indicate keep or mask out ? Is it a 0/1 tensor or already an additive -inf bias?
Should the module include dropout on the attention weights, and/or a scaling/bias term in the projections?
Do we need to support variable-length sequences in a batch (i.e. a real padding mask), or are all sequences the same length?
Are we optimizing for clarity/correctness, or should I also discuss a fused/flash-attention-style kernel and memory complexity?
Should the output keep shape (B, T, d_model) to slot directly into a residual + LayerNorm block?

What a Strong Answer Covers

Correct math : the $1/\sqrt{d_k}$ scaling and why it exists (controlling logit variance so softmax gradients don't vanish).
Shape bookkeeping : an explicit shape at every step—projection, head split, scores, masked scores, softmax, context, head merge, output projection.
Single-projection-then-split trick rather than per-head layers (efficiency and parameter parity).
Mask correctness : pre-softmax application, broadcasting a (B, T) padding mask, building a triangular causal mask, and combining the two.
Numerical care : large-negative fill instead of post-softmax zeroing; awareness of the fully-masked-row / all- -inf degenerate case.
Framework hygiene : .contiguous() before view , the assertion d_model % num_heads == 0 , output shape matching input.
Complexity awareness : the attention-matrix cost is $O(T^2 \cdot d_{\text{model}})$ time and $O(H \cdot T^2)$ memory, while the Q/K/V/output projections add $O(T \cdot d_{\text{model}}^2)$ ; knowing which term dominates in which $T$ -vs- $d_{\text{model}}$ regime, and where this becomes the bottleneck.

Follow-up Questions

How does this change for cross-attention (e.g. in a decoder), where keys/values come from the encoder output of a different length?
Attention is $O(T^2)$ in sequence length. What approaches reduce this (FlashAttention's tiling/recomputation, or linear/sparse attention), and what do they trade off?
Where do rotary or other relative position embeddings (RoPE, ALiBi) plug into this module, and why might they be preferred over absolute positional encodings added to X ?
An entire row of the attention logits is masked out (a fully-padded query position). What happens to the softmax, and how would you prevent NaN s from propagating?
Why is multi-head attention empirically better than a single head of the same total width, given the parameter count is roughly equal?

Implement Masked Multi-Head Self-Attention

Quick Overview

Constraints & Assumptions

Clarifying Questions to Ask

What a Strong Answer Covers

Follow-up Questions

Solution

Submit Your Answer to Earn 20XP

Implement Masked Multi-Head Self-Attention

Quick Overview

Constraints & Assumptions

Clarifying Questions to Ask

What a Strong Answer Covers

Follow-up Questions

Solution

Submit Your Answer to Earn 20XP