How do I approach Machine Learning interview questions?

Machine Learning questions require understanding of core concepts and practice. PracHub provides solutions with explanations to help you master machine learning interviews.

What difficulty level is this interview question?

This is a medium difficulty Machine Learning question, commonly asked during Technical Screen rounds at OpenAI.

What role is this question designed for?

This question is commonly asked for Machine Learning Engineer candidates at OpenAI during technical interviews.

Debug MiniGPT and Backpropagate Matmul | OpenAI Interview Question

Q: Debug MiniGPT and Backpropagate Matmul

This question evaluates proficiency in debugging PyTorch transformer implementations, reasoning about attention mechanics and causal masking, implementing key–value caching for autoregressive decoding, and constructing provably correct backward passes for matrix multiplications and custom autograd.Functions.

This is a hands-on PyTorch screen with two independent tasks. You share a code editor with the interviewer and are expected to run the code, read tracebacks, and reason out loud. Both parts are graded on correctness, the cleanliness of your debugging process, and how clearly you explain tensor shapes and gradients.

Constraints & Assumptions

Language / framework: Python with PyTorch. You may use torch operations freely but should be able to implement attention and a custom autograd.Function from scratch.
Part A model scale: a small decoder-only transformer (think nanoGPT-class: a handful of layers, hidden size in the low hundreds, a small vocab). It already runs end-to-end — it does not crash — but autoregressive generation produces wrong / garbage text. The bug(s) are logical, not syntactic.
Part B: dense 2-D matrices first, $A \in \mathbb{R}^{m \times k}$ , $B \in \mathbb{R}^{k \times n}$ ; be ready to extend to batched / broadcast inputs.
You have ~1 hour total. Prioritize getting generation correct in Part A and a provably correct backward in Part B over micro-optimizations.

Clarifying Questions to Ask

Part A: is the model already trained (so I should fix generation ) or am I expected to also fix the training loop and retrain on a toy dataset?
Part A: what's the "expected output" oracle — a known target string, a reference implementation, or a loss threshold I should hit?
Part A: which positional scheme does this model use — learned absolute embeddings or rotary (RoPE)? It changes the KV-cache and masking details.
Part B: do you want a torch.autograd.Function with save_for_backward , or just the two gradient formulas implemented as functions? Should I support batched ( bmm ) and broadcasted inputs?
Part B: for the follow-up, are you looking for a conceptual explanation of where scan fits, or do you want me to actually code a tiled Hillis–Steele accumulation?

Part A — Debug a mini GPT and add a KV cache

You are handed a mini GPT-style language model that trains (or runs a forward pass) but produces incorrect text under autoregressive generation. Find and fix the bug(s) until greedy/argmax generation produces the expected output, then implement key–value caching so each decode step requires only $O(T)$ attention over the cached prefix instead of a full $O(T^2)$ recompute of the whole sequence.

In your walkthrough, explicitly account for: tensor shapes at each stage, causal masking, the attention computation, positional information, loss shifting (next-token alignment), train vs. eval mode, and sampling/decoding.

What This Part Should Cover

Disciplined debugging: reproduces deterministically (eval mode, greedy), localizes via shapes/asserts rather than guesswork, and names the specific bug class found (mask, softmax axis, scaling, positions, target shift, or train/eval).
Attention correctness: scaling by $\sqrt{d_h}$ , causal mask applied before softmax, softmax over the key axis, dropout gated by mode, heads split and merged without mixing the head and time axes.
KV cache: clean prefill/decode split, time-axis concatenation, correct positional index for the new token, right-sized (non-square) attention at decode, cache threaded through every layer, and a stated correctness invariant (cached vs. uncached output must be token-for-token identical).
Cost awareness: notes that cache memory grows linearly with context and names at least one reducer (MQA/GQA, quantized, or paged cache).

Part B — Matrix multiply forward + backward, then a parallel-scan follow-up

Implement a custom op for $C = A @ B$ .

Forward: compute $C = A B$ for $A \in \mathbb{R}^{m\times k}$ , $B \in \mathbb{R}^{k\times n}$ .
Backward: given the upstream gradient $dC = \partial L / \partial C$ , derive and implement $dA$ and $dB$ . Show the index-level derivation, not just the matrix formula.
Follow-up: explain how a parallel prefix-scan such as Hillis–Steele could parallelize the associative accumulation in a tiled version of this computation — when it helps and when it doesn't.

What This Part Should Cover

Index-level derivation: both gradients derived entry-by-entry from $C_{ij}=\sum_r A_{ir}B_{rj}$ , then collapsed into the correct matrix form (each gradient a product of $dC$ with a transpose of the other operand), justified by shape checks ( $dA$ matches $A$ , $dB$ matches $B$ ).
Working op: a torch.autograd.Function with save_for_backward , validated with gradcheck ; correct handling of batched ( bmm ) and broadcast reduction (summing the gradient over broadcast axes).
Scan judgment: correctly identifies the associative operation, explains Hillis–Steele's step-offset ( $1,2,4,\dots$ ) structure and its $O(n\log n)$ work / $O(\log n)$ depth, and critically states when a plain reduction or a tuned GEMM kernel beats a scan (work-efficiency) versus when every prefix is genuinely required.

What a Strong Answer Covers

These dimensions span both parts:

Communication: thinks aloud, states assumptions explicitly, and uses pseudocode and concrete tensor shapes to make reasoning legible to the interviewer.
Verification habit: for each fix or derivation, names the smallest test that would prove it correct (deterministic decode, shape asserts, cached-vs-uncached parity, gradcheck ) rather than asserting correctness by inspection.

Follow-up Questions

In the KV cache, what is the memory footprint as context length grows, and what techniques (e.g. multi-query / grouped-query attention, paged or quantized caches) reduce it?
How would batched generation with different sequence lengths interact with a single shared cache and the causal mask?
For Part B, derive the backward when $A$ or $B$ is broadcast (e.g. a single weight matrix multiplied against a batch of inputs) — over which axes must you sum the gradient?
Hillis–Steele is step-efficient but not work-efficient ( $O(n\log n)$ work). When would you prefer a Blelloch (work-efficient) scan instead, and what's the tradeoff?

Constraints & Assumptions

Language / framework: Python with PyTorch. You may use torch operations freely but should be able to implement attention and a custom autograd.Function from scratch.
Part A model scale: a small decoder-only transformer (think nanoGPT-class: a handful of layers, hidden size in the low hundreds, a small vocab). It already runs end-to-end — it does not crash — but autoregressive generation produces wrong / garbage text. The bug(s) are logical, not syntactic.
Part B: dense 2-D matrices first, $A \in \mathbb{R}^{m \times k}$ , $B \in \mathbb{R}^{k \times n}$ ; be ready to extend to batched / broadcast inputs.
You have ~1 hour total. Prioritize getting generation correct in Part A and a provably correct backward in Part B over micro-optimizations.

Clarifying Questions to Ask

Part A: is the model already trained (so I should fix generation ) or am I expected to also fix the training loop and retrain on a toy dataset?
Part A: what's the "expected output" oracle — a known target string, a reference implementation, or a loss threshold I should hit?
Part A: which positional scheme does this model use — learned absolute embeddings or rotary (RoPE)? It changes the KV-cache and masking details.
Part B: do you want a torch.autograd.Function with save_for_backward , or just the two gradient formulas implemented as functions? Should I support batched ( bmm ) and broadcasted inputs?
Part B: for the follow-up, are you looking for a conceptual explanation of where scan fits, or do you want me to actually code a tiled Hillis–Steele accumulation?

Part A — Debug a mini GPT and add a KV cache

What This Part Should Cover

Disciplined debugging: reproduces deterministically (eval mode, greedy), localizes via shapes/asserts rather than guesswork, and names the specific bug class found (mask, softmax axis, scaling, positions, target shift, or train/eval).
Attention correctness: scaling by $\sqrt{d_h}$ , causal mask applied before softmax, softmax over the key axis, dropout gated by mode, heads split and merged without mixing the head and time axes.
KV cache: clean prefill/decode split, time-axis concatenation, correct positional index for the new token, right-sized (non-square) attention at decode, cache threaded through every layer, and a stated correctness invariant (cached vs. uncached output must be token-for-token identical).
Cost awareness: notes that cache memory grows linearly with context and names at least one reducer (MQA/GQA, quantized, or paged cache).

Part B — Matrix multiply forward + backward, then a parallel-scan follow-up

Implement a custom op for $C = A @ B$ .

Forward: compute $C = A B$ for $A \in \mathbb{R}^{m\times k}$ , $B \in \mathbb{R}^{k\times n}$ .
Backward: given the upstream gradient $dC = \partial L / \partial C$ , derive and implement $dA$ and $dB$ . Show the index-level derivation, not just the matrix formula.
Follow-up: explain how a parallel prefix-scan such as Hillis–Steele could parallelize the associative accumulation in a tiled version of this computation — when it helps and when it doesn't.

What This Part Should Cover

Index-level derivation: both gradients derived entry-by-entry from $C_{ij}=\sum_r A_{ir}B_{rj}$ , then collapsed into the correct matrix form (each gradient a product of $dC$ with a transpose of the other operand), justified by shape checks ( $dA$ matches $A$ , $dB$ matches $B$ ).
Working op: a torch.autograd.Function with save_for_backward , validated with gradcheck ; correct handling of batched ( bmm ) and broadcast reduction (summing the gradient over broadcast axes).
Scan judgment: correctly identifies the associative operation, explains Hillis–Steele's step-offset ( $1,2,4,\dots$ ) structure and its $O(n\log n)$ work / $O(\log n)$ depth, and critically states when a plain reduction or a tuned GEMM kernel beats a scan (work-efficiency) versus when every prefix is genuinely required.

What a Strong Answer Covers

These dimensions span both parts:

Communication: thinks aloud, states assumptions explicitly, and uses pseudocode and concrete tensor shapes to make reasoning legible to the interviewer.
Verification habit: for each fix or derivation, names the smallest test that would prove it correct (deterministic decode, shape asserts, cached-vs-uncached parity, gradcheck ) rather than asserting correctness by inspection.

Follow-up Questions

In the KV cache, what is the memory footprint as context length grows, and what techniques (e.g. multi-query / grouped-query attention, paged or quantized caches) reduce it?
How would batched generation with different sequence lengths interact with a single shared cache and the causal mask?
For Part B, derive the backward when $A$ or $B$ is broadcast (e.g. a single weight matrix multiplied against a batch of inputs) — over which axes must you sum the gradient?
Hillis–Steele is step-efficient but not work-efficient ( $O(n\log n)$ work). When would you prefer a Blelloch (work-efficient) scan instead, and what's the tradeoff?

Debug MiniGPT and Backpropagate Matmul

Quick Overview

Debug MiniGPT and Backpropagate Matmul

Constraints & Assumptions

Clarifying Questions to Ask

Part A — Debug a mini GPT and add a KV cache

What This Part Should Cover

Part B — Matrix multiply forward + backward, then a parallel-scan follow-up

What This Part Should Cover

What a Strong Answer Covers

Follow-up Questions

Write your answer

Debug MiniGPT and Backpropagate Matmul

Quick Overview

Debug MiniGPT and Backpropagate Matmul

Constraints & Assumptions

Clarifying Questions to Ask

Part A — Debug a mini GPT and add a KV cache

What This Part Should Cover

Part B — Matrix multiply forward + backward, then a parallel-scan follow-up

What This Part Should Cover

What a Strong Answer Covers

Follow-up Questions

Write your answer