How do I approach Machine Learning interview questions?

Machine Learning questions require understanding of core concepts and practice. PracHub provides solutions with explanations to help you master machine learning interviews.

What difficulty level is this interview question?

This is a hard difficulty Machine Learning question, commonly asked during Technical Screen rounds at Datadog.

What role is this question designed for?

This question is commonly asked for Machine Learning Engineer candidates at Datadog during technical interviews.

Implement Grouped-Query Attention (GQA) | Datadog Interview Question

Q: Implement Grouped-Query Attention (GQA)

This question evaluates a candidate's understanding of transformer attention mechanisms, specifically how grouped-query attention balances the memory efficiency of multi-query attention with the quality of full multi-head attention. It tests practical implementation skill with tensor reshaping, head grouping, and causal masking, a common way to probe machine learning engineering depth in system-level model design interviews.

Implement Grouped-Query Attention (GQA)

Modern decoder-only transformers spend most of their autoregressive-decoding time and memory moving the key/value (KV) cache in and out of GPU memory. Multi-Head Attention (MHA) keeps one K and one V per query head, which is expensive; Multi-Query Attention (MQA) shares a single K/V across all heads, which is cheap but can hurt quality. Grouped-Query Attention (GQA) is the middle ground: the query heads are partitioned into $G$ groups, and all query heads within a group share one K/V head.

Implement the forward pass of a GQA module from scratch using only basic tensor ops (linear layers + matmul + softmax; no high-level attention helper). Given:

input hidden states $x$ of shape $(B, T, d_{\text{model}})$ ,
num_query_heads $H$ and num_kv_heads $G$ , with $H$ divisible by $G$ ,
per-head dimension $d_{\text{head}} = d_{\text{model}} / H$ ,

your module should:

Project $x$ to queries $Q$ ( $H$ heads) and keys/values $K, V$ ( $G$ heads each).
Reshape into heads and share each KV head across $H/G$ query heads .
Compute causal scaled dot-product attention.
Concatenate heads and apply the output projection back to $d_{\text{model}}$ .

Constraints & Assumptions

Shapes: $x$ is $(B, T, d_{\text{model}})$ ; the output is $(B, T, d_{\text{model}})$ .
$H \bmod G = 0$ (e.g. $H=8,\ G=2$ → each KV head shared by 4 query heads). $G = H$ is MHA; $G = 1$ is MQA.
$d_{\text{model}} \bmod H = 0$ ; $d_{\text{head}} = d_{\text{model}} // H$ .
Scale scores by $1/\sqrt{d_{\text{head}}}$ .
Decoder self-attention : apply a causal mask so position $t$ attends only to positions $\le t$ .
Single-precision floats; the mask fill should use a large negative value (or $-\infty$ ) compatible with softmax.
Dropout and rotary embeddings (RoPE) may be ignored in the core implementation (mention them as extensions).

Clarifying Questions to Ask

Is this causal self-attention in a decoder, or cross-attention? I will assume causal self-attention.
Should I implement only the full-sequence forward (training / prefill), or also the incremental single-token decode with a KV cache?
Are positional encodings (e.g. RoPE) applied to $Q/K$ inside the module, or handled outside?
Besides the causal mask, is there a padding mask for variable-length sequences in the batch?
Do the Q/K/V/O linear projections include bias terms?

What a Strong Answer Covers Premium

Follow-up Questions

During autoregressive decoding with a KV cache, how much memory does GQA save versus MHA, and why is memory bandwidth (not FLOPs) the decode-time bottleneck?
How would you "uptrain" an existing MHA checkpoint into GQA — how do you initialize the $G$ KV heads from the original $H$ ?
Where do rotary position embeddings (RoPE) get applied, and does GQA change that?
Implement the incremental decode step : given cached $K, V$ of length $t$ and one new token, produce the next output and update the cache. Which shapes change?
How does FlashAttention interact with GQA, and what changes about the memory-access pattern?

Implement Grouped-Query Attention (GQA)

Implement the forward pass of a GQA module from scratch using only basic tensor ops (linear layers + matmul + softmax; no high-level attention helper). Given:

input hidden states $x$ of shape $(B, T, d_{\text{model}})$ ,
num_query_heads $H$ and num_kv_heads $G$ , with $H$ divisible by $G$ ,
per-head dimension $d_{\text{head}} = d_{\text{model}} / H$ ,

your module should:

Project $x$ to queries $Q$ ( $H$ heads) and keys/values $K, V$ ( $G$ heads each).
Reshape into heads and share each KV head across $H/G$ query heads .
Compute causal scaled dot-product attention.
Concatenate heads and apply the output projection back to $d_{\text{model}}$ .

Constraints & Assumptions

Shapes: $x$ is $(B, T, d_{\text{model}})$ ; the output is $(B, T, d_{\text{model}})$ .
$H \bmod G = 0$ (e.g. $H=8,\ G=2$ → each KV head shared by 4 query heads). $G = H$ is MHA; $G = 1$ is MQA.
$d_{\text{model}} \bmod H = 0$ ; $d_{\text{head}} = d_{\text{model}} // H$ .
Scale scores by $1/\sqrt{d_{\text{head}}}$ .
Decoder self-attention : apply a causal mask so position $t$ attends only to positions $\le t$ .
Single-precision floats; the mask fill should use a large negative value (or $-\infty$ ) compatible with softmax.
Dropout and rotary embeddings (RoPE) may be ignored in the core implementation (mention them as extensions).

Clarifying Questions to Ask

Is this causal self-attention in a decoder, or cross-attention? I will assume causal self-attention.
Should I implement only the full-sequence forward (training / prefill), or also the incremental single-token decode with a KV cache?
Are positional encodings (e.g. RoPE) applied to $Q/K$ inside the module, or handled outside?
Besides the causal mask, is there a padding mask for variable-length sequences in the batch?
Do the Q/K/V/O linear projections include bias terms?

What a Strong Answer Covers Premium

Follow-up Questions

During autoregressive decoding with a KV cache, how much memory does GQA save versus MHA, and why is memory bandwidth (not FLOPs) the decode-time bottleneck?
How would you "uptrain" an existing MHA checkpoint into GQA — how do you initialize the $G$ KV heads from the original $H$ ?
Where do rotary position embeddings (RoPE) get applied, and does GQA change that?
Implement the incremental decode step : given cached $K, V$ of length $t$ and one new token, produce the next output and update the cache. Which shapes change?
How does FlashAttention interact with GQA, and what changes about the memory-access pattern?

Implement Grouped-Query Attention (GQA)

Quick Overview

Implement Grouped-Query Attention (GQA)

Implement Grouped-Query Attention (GQA)

Constraints & Assumptions

Clarifying Questions to Ask

What a Strong Answer Covers Premium

Follow-up Questions

Write your answer

Implement Grouped-Query Attention (GQA)

Quick Overview

Implement Grouped-Query Attention (GQA)

Implement Grouped-Query Attention (GQA)

Constraints & Assumptions

Clarifying Questions to Ask

What a Strong Answer Covers Premium

Follow-up Questions

Write your answer