How do I approach Machine Learning interview questions?

Machine Learning questions require understanding of core concepts and practice. PracHub provides solutions with explanations to help you master machine learning interviews.

What difficulty level is this interview question?

This is a medium difficulty Machine Learning question, commonly asked during Onsite rounds at OpenAI.

What role is this question designed for?

This question is commonly asked for Machine Learning Engineer candidates at OpenAI during technical interviews.

Debug Transformer and Add KV Cache | OpenAI Interview Question

Q: Debug Transformer and Add KV Cache

This question evaluates debugging and implementation skills for transformer-based autoregressive language models, focusing on attention mechanics, positional embeddings, causal masking, and integrating a key-value (KV) cache for incremental decoding.

You are given a small decoder-only transformer (GPT-style) implemented in PyTorch for autoregressive (next-token) language modeling. The starter code includes token/positional embeddings, a stack of masked self-attention blocks, an LM head, a training loop, and a naive generate function. A separate KVCache class for storing attention keys and values is also provided.

This is a two-part hands-on coding exercise: first make training correct, then add an inference-time KV cache and prove it is numerically equivalent to the uncached path.

Constraints & Assumptions

Decoder-only (GPT-style) architecture: causal multi-head self-attention + MLP blocks with residual connections and LayerNorm, learned (not sinusoidal) token and positional embeddings, and a final vocabulary head.
Training objective is next-token prediction (causal LM) with cross-entropy loss.
You may run small training/eval loops to inspect behavior; assume CPU or a single GPU and a tiny toy dataset (e.g. character-level) so you can iterate in seconds.
The four Part 1 bugs are real correctness bugs in the given code — not architectural choices. After fixing, the model must be able to overfit a tiny batch.
For Part 2, the goal is an optimization that does not change outputs : with dropout disabled and deterministic (greedy) decoding, cached and uncached generation must emit the identical token sequence.

Clarifying Questions to Ask

Does the model's forward return logits for the full input length [B, T, V] , and is the cross-entropy loss expected to be computed inside or outside forward ?
Are the positional embeddings a learned nn.Embedding / nn.Parameter of shape [max_seq_len, d_model] , and what is max_seq_len ?
Is the attention mask additive (added to scores pre-softmax) or boolean, and what is its expected shape/broadcast convention ( [B, H, T, T] )?
Does generate decode greedily or with sampling, and is there a fixed seed I should hold constant when comparing the cached vs. uncached paths?
Is the KVCache per-layer (a list of caches) or a single shared object, and what is its exact append/read API?

Part 1: Debug the training code

The training code contains four bugs. Fix them so training loss decreases and generated text becomes plausible. The bug categories are:

Label shift — the LM loss uses the wrong target alignment (input/target slicing).
Positional embedding initialization — the positional embeddings are initialized incorrectly.
Attention mask — the causal mask is wrong.
A typo-related bug — one additional silent bug somewhere in the code (e.g. a misspelled field, a swapped tensor, an off-by-one, or a dropped transform).

After your fixes, the model should train normally, loss should go down, and sampled outputs should look reasonable.

Clarifying Questions for this Part

Are dropout and weight tying (between the input embedding and the LM head) part of the intended design, so I don't "fix" something that is actually correct?
Should the attention scores be scaled by $1/\sqrt{d_{head}}$ in the given code, or is that handled elsewhere?

What This Part Should Cover

Correctly identifies and fixes all four bugs, and explains why each is wrong (not just patches it): the identity-mapping failure from a missing shift, future-information leakage from a bad mask, training instability from bad positional init.
A disciplined debugging method — overfit-a-tiny-batch as the oracle, shape/value printing, bisection — rather than guess-and-check.
A correctness signal after fixing: loss drops on a toy dataset and generated text stops being random noise.

Part 2: Add and validate a KV cache

A KVCache class for attention keys and values is provided. Extend the model to support cached autoregressive decoding so each new step does not recompute keys/values for the whole prefix. Update the implementation so that:

The attention module reads from and appends to the cache correctly (use the cached K / V for the new query, then store the new K / V ).
Causal masking is only applied when there is no cache or the cache is empty — a single new token attending to an all-past cache needs no full causal mask.
Positional encodings use the correct offset based on the current cache length (the new token is at position cache_len , not 0 ).
The generate function performs incremental decoding with the cache (pre-fill the prompt, then feed one token at a time).

Finally, verify that deterministic generation with and without the KV cache produces the same token sequence.

Clarifying Questions for this Part

Does the existing forward need to optionally accept and return per-layer cache state, or should I subclass/wrap the model to thread it through?
Should generate pre-fill the prompt in a single forward (then loop one token), or feed the prompt token-by-token to build the cache?

What This Part Should Cover

A correct, layer-wise cache: new K/V appended in time order, query attends against the full prefix, cache returned/persisted across steps.
Right masking discipline: no full causal mask for single-token-with-cache, but a causal mask retained for multi-token pre-fill.
Correct position offset ( + cache_len ) and a clear explanation of why omitting it breaks equivalence.
An explicit equivalence test (eval mode + greedy + fixed seed) and an articulated complexity argument: per-step attention recompute drops from $O(T^2)$ over the run to $O(T)$ work per new token.

What a Strong Answer Covers

These dimensions span both parts:

Treats the KV cache as an inference-only optimization that must not change outputs — never letting it alter training or numerical results.
Uses reproducible validation throughout: overfit-a-batch for Part 1, exact token-sequence equality for Part 2, with dropout off and decoding held deterministic.
Communicates the why behind each fix and each cache decision, not just the code edit.

Follow-up Questions

What is the time and memory complexity of generating $T$ tokens with vs. without the KV cache? Estimate cache memory for $L$ layers, $H$ heads, head dim $D$ , sequence length $T$ , batch size $B$ , and discuss when memory (not compute) becomes the bottleneck for long contexts.
How would you extend the cache to batched decoding with prompts of different lengths (left-padding, ragged caches, per-sequence position offsets), and what breaks if the per-sequence offsets are wrong?
Sketch how this changes for rotary position embeddings (RoPE) — where does the offset get applied, and does the cache store pre- or post-rotation keys?
How do multi-query attention (MQA) or grouped-query attention (GQA) reduce the KV-cache footprint, and what is the quality trade-off?
If equivalence fails by one token deep into a long generation, how would you localize the cause — what would you log, and at which layer/step?

This is a two-part hands-on coding exercise: first make training correct, then add an inference-time KV cache and prove it is numerically equivalent to the uncached path.

Constraints & Assumptions

Decoder-only (GPT-style) architecture: causal multi-head self-attention + MLP blocks with residual connections and LayerNorm, learned (not sinusoidal) token and positional embeddings, and a final vocabulary head.
Training objective is next-token prediction (causal LM) with cross-entropy loss.
You may run small training/eval loops to inspect behavior; assume CPU or a single GPU and a tiny toy dataset (e.g. character-level) so you can iterate in seconds.
The four Part 1 bugs are real correctness bugs in the given code — not architectural choices. After fixing, the model must be able to overfit a tiny batch.
For Part 2, the goal is an optimization that does not change outputs : with dropout disabled and deterministic (greedy) decoding, cached and uncached generation must emit the identical token sequence.

Clarifying Questions to Ask

Does the model's forward return logits for the full input length [B, T, V] , and is the cross-entropy loss expected to be computed inside or outside forward ?
Are the positional embeddings a learned nn.Embedding / nn.Parameter of shape [max_seq_len, d_model] , and what is max_seq_len ?
Is the attention mask additive (added to scores pre-softmax) or boolean, and what is its expected shape/broadcast convention ( [B, H, T, T] )?
Does generate decode greedily or with sampling, and is there a fixed seed I should hold constant when comparing the cached vs. uncached paths?
Is the KVCache per-layer (a list of caches) or a single shared object, and what is its exact append/read API?

Part 1: Debug the training code

The training code contains four bugs. Fix them so training loss decreases and generated text becomes plausible. The bug categories are:

Label shift — the LM loss uses the wrong target alignment (input/target slicing).
Positional embedding initialization — the positional embeddings are initialized incorrectly.
Attention mask — the causal mask is wrong.
A typo-related bug — one additional silent bug somewhere in the code (e.g. a misspelled field, a swapped tensor, an off-by-one, or a dropped transform).

After your fixes, the model should train normally, loss should go down, and sampled outputs should look reasonable.

Clarifying Questions for this Part

Are dropout and weight tying (between the input embedding and the LM head) part of the intended design, so I don't "fix" something that is actually correct?
Should the attention scores be scaled by $1/\sqrt{d_{head}}$ in the given code, or is that handled elsewhere?

What This Part Should Cover

Correctly identifies and fixes all four bugs, and explains why each is wrong (not just patches it): the identity-mapping failure from a missing shift, future-information leakage from a bad mask, training instability from bad positional init.
A disciplined debugging method — overfit-a-tiny-batch as the oracle, shape/value printing, bisection — rather than guess-and-check.
A correctness signal after fixing: loss drops on a toy dataset and generated text stops being random noise.

Part 2: Add and validate a KV cache

The attention module reads from and appends to the cache correctly (use the cached K / V for the new query, then store the new K / V ).
Causal masking is only applied when there is no cache or the cache is empty — a single new token attending to an all-past cache needs no full causal mask.
Positional encodings use the correct offset based on the current cache length (the new token is at position cache_len , not 0 ).
The generate function performs incremental decoding with the cache (pre-fill the prompt, then feed one token at a time).

Finally, verify that deterministic generation with and without the KV cache produces the same token sequence.

Clarifying Questions for this Part

Does the existing forward need to optionally accept and return per-layer cache state, or should I subclass/wrap the model to thread it through?
Should generate pre-fill the prompt in a single forward (then loop one token), or feed the prompt token-by-token to build the cache?

What This Part Should Cover

A correct, layer-wise cache: new K/V appended in time order, query attends against the full prefix, cache returned/persisted across steps.
Right masking discipline: no full causal mask for single-token-with-cache, but a causal mask retained for multi-token pre-fill.
Correct position offset ( + cache_len ) and a clear explanation of why omitting it breaks equivalence.
An explicit equivalence test (eval mode + greedy + fixed seed) and an articulated complexity argument: per-step attention recompute drops from $O(T^2)$ over the run to $O(T)$ work per new token.

What a Strong Answer Covers

These dimensions span both parts:

Treats the KV cache as an inference-only optimization that must not change outputs — never letting it alter training or numerical results.
Uses reproducible validation throughout: overfit-a-batch for Part 1, exact token-sequence equality for Part 2, with dropout off and decoding held deterministic.
Communicates the why behind each fix and each cache decision, not just the code edit.

Follow-up Questions

What is the time and memory complexity of generating $T$ tokens with vs. without the KV cache? Estimate cache memory for $L$ layers, $H$ heads, head dim $D$ , sequence length $T$ , batch size $B$ , and discuss when memory (not compute) becomes the bottleneck for long contexts.
How would you extend the cache to batched decoding with prompts of different lengths (left-padding, ragged caches, per-sequence position offsets), and what breaks if the per-sequence offsets are wrong?
Sketch how this changes for rotary position embeddings (RoPE) — where does the offset get applied, and does the cache store pre- or post-rotation keys?
How do multi-query attention (MQA) or grouped-query attention (GQA) reduce the KV-cache footprint, and what is the quality trade-off?
If equivalence fails by one token deep into a long generation, how would you localize the cause — what would you log, and at which layer/step?

Debug Transformer and Add KV Cache

Quick Overview

Constraints & Assumptions

Clarifying Questions to Ask

Part 1: Debug the training code

Clarifying Questions for this Part

What This Part Should Cover

Part 2: Add and validate a KV cache

Clarifying Questions for this Part

What This Part Should Cover

What a Strong Answer Covers

Follow-up Questions

Solution

Submit Your Answer to Earn 20XP

Debug Transformer and Add KV Cache

Quick Overview

Constraints & Assumptions

Clarifying Questions to Ask

Part 1: Debug the training code

Clarifying Questions for this Part

What This Part Should Cover

Part 2: Add and validate a KV cache

Clarifying Questions for this Part

What This Part Should Cover

What a Strong Answer Covers

Follow-up Questions

Solution

Submit Your Answer to Earn 20XP