PracHub
QuestionsCoachesLearningGuidesInterview Prep
|Home/Machine Learning/OpenAI

Debug MiniGPT and Backpropagate Matmul

Last updated: Jun 21, 2026

Quick Overview

This question evaluates proficiency in debugging PyTorch transformer implementations, reasoning about attention mechanics and causal masking, implementing key–value caching for autoregressive decoding, and constructing provably correct backward passes for matrix multiplications and custom autograd.Functions.

  • medium
  • OpenAI
  • Machine Learning
  • Machine Learning Engineer

Debug MiniGPT and Backpropagate Matmul

Company: OpenAI

Role: Machine Learning Engineer

Category: Machine Learning

Difficulty: medium

Interview Round: Technical Screen

This is a hands-on PyTorch screen with two independent tasks. You share a code editor with the interviewer and are expected to run the code, read tracebacks, and reason out loud. Both parts are graded on correctness, the cleanliness of your debugging process, and how clearly you explain tensor shapes and gradients. ### Constraints & Assumptions - **Language / framework:** Python with PyTorch. You may use `torch` operations freely but should be able to implement attention and a custom `autograd.Function` from scratch. - **Part A model scale:** a small decoder-only transformer (think nanoGPT-class: a handful of layers, hidden size in the low hundreds, a small vocab). It already runs end-to-end — it does **not** crash — but autoregressive generation produces wrong / garbage text. The bug(s) are logical, not syntactic. - **Part B:** dense 2-D matrices first, $A \in \mathbb{R}^{m \times k}$, $B \in \mathbb{R}^{k \times n}$; be ready to extend to batched / broadcast inputs. - You have ~1 hour total. Prioritize getting generation correct in Part A and a provably correct backward in Part B over micro-optimizations. ### Clarifying Questions to Ask - Part A: is the model already trained (so I should fix *generation*) or am I expected to also fix the *training* loop and retrain on a toy dataset? - Part A: what's the "expected output" oracle — a known target string, a reference implementation, or a loss threshold I should hit? - Part A: which positional scheme does this model use — learned absolute embeddings or rotary (RoPE)? It changes the KV-cache and masking details. - Part B: do you want a `torch.autograd.Function` with `save_for_backward`, or just the two gradient formulas implemented as functions? Should I support batched (`bmm`) and broadcasted inputs? - Part B: for the follow-up, are you looking for a conceptual explanation of where scan fits, or do you want me to actually code a tiled Hillis–Steele accumulation? --- ### Part A — Debug a mini GPT and add a KV cache You are handed a mini GPT-style language model that trains (or runs a forward pass) but produces incorrect text under autoregressive generation. **Find and fix the bug(s)** until greedy/argmax generation produces the expected output, then **implement key–value caching** so each decode step requires only $O(T)$ attention over the cached prefix instead of a full $O(T^2)$ recompute of the whole sequence. In your walkthrough, explicitly account for: **tensor shapes** at each stage, **causal masking**, the **attention computation**, **positional information**, **loss shifting** (next-token alignment), **train vs. eval mode**, and **sampling/decoding**. ```hint Where to start — make it observable Don't read the whole model top-to-bottom. Pick the smallest signal that localizes the bug: feed a known prefix and `argmax`-decode (no sampling) so output is deterministic; print `logits.shape` and the shape of every intermediate. Most miniGPT bugs are a wrong dimension or a mis-aligned target, and shapes/determinism expose them fast. ``` ```hint Work a checklist, don't guess The prompt already names the dimensions to account for (shapes, masking, the attention computation, positions, loss shifting, train/eval, decoding). Turn that list into a *checklist you verify one item at a time* rather than eyeballing the code. For each item, write the smallest assertion or print that would catch a violation, then run it — the bug announces itself when one check fails. Two checks that catch the most miniGPT bugs cheaply: confirm your decode is deterministic (eval mode, greedy) so behavior is reproducible, and confirm each intermediate tensor's shape matches what you expect before trusting its values. ``` ```hint KV cache — what actually changes at decode Separate **prefill** (run the full prompt once, store per-layer $K,V$ of shape $[\,B, H, T, d_h\,]$) from **decode** (only the *new* token flows through; its query is length-1 and attends over the concatenated cache). Concatenate new $K,V$ along the **time** axis — concatenating along heads or batch is the #1 cache bug. With a length-1 query you no longer need a full square causal mask. ``` #### What This Part Should Cover - **Disciplined debugging:** reproduces deterministically (eval mode, greedy), localizes via shapes/asserts rather than guesswork, and names the specific bug class found (mask, softmax axis, scaling, positions, target shift, or train/eval). - **Attention correctness:** scaling by $\sqrt{d_h}$, causal mask applied *before* softmax, softmax over the key axis, dropout gated by mode, heads split and merged without mixing the head and time axes. - **KV cache:** clean prefill/decode split, time-axis concatenation, correct positional index for the new token, right-sized (non-square) attention at decode, cache threaded through *every* layer, and a stated correctness invariant (cached vs. uncached output must be token-for-token identical). - **Cost awareness:** notes that cache memory grows linearly with context and names at least one reducer (MQA/GQA, quantized, or paged cache). --- ### Part B — Matrix multiply forward + backward, then a parallel-scan follow-up Implement a custom op for $C = A @ B$. 1. **Forward:** compute $C = A B$ for $A \in \mathbb{R}^{m\times k}$, $B \in \mathbb{R}^{k\times n}$. 2. **Backward:** given the upstream gradient $dC = \partial L / \partial C$, **derive and implement** $dA$ and $dB$. Show the index-level derivation, not just the matrix formula. 3. **Follow-up:** explain how a parallel prefix-scan such as **Hillis–Steele** could parallelize the associative accumulation in a *tiled* version of this computation — when it helps and when it doesn't. ```hint Backward — derive it from one entry Start from the scalar definition $C_{ij} = \sum_r A_{ir} B_{rj}$ and apply the chain rule one entry at a time, asking which $C$ entries a single $A_{ir}$ (or $B_{rj}$) actually touches. Once you have $\partial L/\partial A_{ir}$ as a sum, collapse it back into matrix form — notice each gradient pairs $dC$ with exactly one of the two operands. Sanity-check shapes: $dA$ must match $A$, $dB$ must match $B$. ``` ```hint Scan — what is actually associative here The reduction over the contraction index $k$ is a sum, and **addition is associative**, so partial gradients over tiles of $k$ can be combined in any grouping. Frame each tile as a partial product (e.g. `dC_tile @ B_tile.T`) and ask: do you need *only the total* (a plain reduction / tree-sum is better) or *every prefix* (then a Hillis–Steele scan with offsets $1,2,4,\dots$ is the right tool)? Be honest about when scan is the wrong hammer. ``` #### What This Part Should Cover - **Index-level derivation:** both gradients derived entry-by-entry from $C_{ij}=\sum_r A_{ir}B_{rj}$, then collapsed into the correct matrix form (each gradient a product of $dC$ with a transpose of the *other* operand), justified by shape checks ($dA$ matches $A$, $dB$ matches $B$). - **Working op:** a `torch.autograd.Function` with `save_for_backward`, validated with `gradcheck`; correct handling of batched (`bmm`) and broadcast reduction (summing the gradient over broadcast axes). - **Scan judgment:** correctly identifies the associative operation, explains Hillis–Steele's step-offset ($1,2,4,\dots$) structure and its $O(n\log n)$ work / $O(\log n)$ depth, and critically states when a plain reduction or a tuned GEMM kernel beats a scan (work-efficiency) versus when *every prefix* is genuinely required. --- ### What a Strong Answer Covers These dimensions span both parts: - **Communication:** thinks aloud, states assumptions explicitly, and uses pseudocode and concrete tensor shapes to make reasoning legible to the interviewer. - **Verification habit:** for each fix or derivation, names the smallest test that would prove it correct (deterministic decode, shape asserts, cached-vs-uncached parity, `gradcheck`) rather than asserting correctness by inspection. ### Follow-up Questions - In the KV cache, what is the memory footprint as context length grows, and what techniques (e.g. multi-query / grouped-query attention, paged or quantized caches) reduce it? - How would batched generation with *different* sequence lengths interact with a single shared cache and the causal mask? - For Part B, derive the backward when $A$ or $B$ is broadcast (e.g. a single weight matrix multiplied against a batch of inputs) — over which axes must you sum the gradient? - Hillis–Steele is *step-efficient* but not *work-efficient* ($O(n\log n)$ work). When would you prefer a Blelloch (work-efficient) scan instead, and what's the tradeoff?

Quick Answer: This question evaluates proficiency in debugging PyTorch transformer implementations, reasoning about attention mechanics and causal masking, implementing key–value caching for autoregressive decoding, and constructing provably correct backward passes for matrix multiplications and custom autograd.Functions.

Related Interview Questions

  • Implement 1NN with NumPy - OpenAI (medium)
  • Compute entropy and implement 1-NN - OpenAI (medium)
  • Defend a Research Direction and Experiment Design - OpenAI (medium)
  • Implement Backprop for a Tiny Network - OpenAI (hard)
  • Filter Bad Human Annotations - OpenAI (medium)
|Home/Machine Learning/OpenAI

Debug MiniGPT and Backpropagate Matmul

OpenAI logo
OpenAI
Apr 3, 2026, 12:00 AM
mediumMachine Learning EngineerTechnical ScreenMachine Learning
64
0

This is a hands-on PyTorch screen with two independent tasks. You share a code editor with the interviewer and are expected to run the code, read tracebacks, and reason out loud. Both parts are graded on correctness, the cleanliness of your debugging process, and how clearly you explain tensor shapes and gradients.

Constraints & Assumptions

  • Language / framework: Python with PyTorch. You may use torch operations freely but should be able to implement attention and a custom autograd.Function from scratch.
  • Part A model scale: a small decoder-only transformer (think nanoGPT-class: a handful of layers, hidden size in the low hundreds, a small vocab). It already runs end-to-end — it does not crash — but autoregressive generation produces wrong / garbage text. The bug(s) are logical, not syntactic.
  • Part B: dense 2-D matrices first, A∈Rm×kA \in \mathbb{R}^{m \times k}A∈Rm×k , B∈Rk×nB \in \mathbb{R}^{k \times n}B∈Rk×n ; be ready to extend to batched / broadcast inputs.
  • You have ~1 hour total. Prioritize getting generation correct in Part A and a provably correct backward in Part B over micro-optimizations.

Clarifying Questions to Ask

  • Part A: is the model already trained (so I should fix generation ) or am I expected to also fix the training loop and retrain on a toy dataset?
  • Part A: what's the "expected output" oracle — a known target string, a reference implementation, or a loss threshold I should hit?
  • Part A: which positional scheme does this model use — learned absolute embeddings or rotary (RoPE)? It changes the KV-cache and masking details.
  • Part B: do you want a torch.autograd.Function with save_for_backward , or just the two gradient formulas implemented as functions? Should I support batched ( bmm ) and broadcasted inputs?
  • Part B: for the follow-up, are you looking for a conceptual explanation of where scan fits, or do you want me to actually code a tiled Hillis–Steele accumulation?

Part A — Debug a mini GPT and add a KV cache

You are handed a mini GPT-style language model that trains (or runs a forward pass) but produces incorrect text under autoregressive generation. Find and fix the bug(s) until greedy/argmax generation produces the expected output, then implement key–value caching so each decode step requires only O(T)O(T)O(T) attention over the cached prefix instead of a full O(T2)O(T^2)O(T2) recompute of the whole sequence.

In your walkthrough, explicitly account for: tensor shapes at each stage, causal masking, the attention computation, positional information, loss shifting (next-token alignment), train vs. eval mode, and sampling/decoding.

What This Part Should Cover

  • Disciplined debugging: reproduces deterministically (eval mode, greedy), localizes via shapes/asserts rather than guesswork, and names the specific bug class found (mask, softmax axis, scaling, positions, target shift, or train/eval).
  • Attention correctness: scaling by dh\sqrt{d_h}dh​​ , causal mask applied before softmax, softmax over the key axis, dropout gated by mode, heads split and merged without mixing the head and time axes.
  • KV cache: clean prefill/decode split, time-axis concatenation, correct positional index for the new token, right-sized (non-square) attention at decode, cache threaded through every layer, and a stated correctness invariant (cached vs. uncached output must be token-for-token identical).
  • Cost awareness: notes that cache memory grows linearly with context and names at least one reducer (MQA/GQA, quantized, or paged cache).

Part B — Matrix multiply forward + backward, then a parallel-scan follow-up

Implement a custom op for C=A@BC = A @ BC=A@B.

  1. Forward: compute C=ABC = A BC=AB for A∈Rm×kA \in \mathbb{R}^{m\times k}A∈Rm×k , B∈Rk×nB \in \mathbb{R}^{k\times n}B∈Rk×n .
  2. Backward: given the upstream gradient dC=∂L/∂CdC = \partial L / \partial CdC=∂L/∂C , derive and implement dAdAdA and dBdBdB . Show the index-level derivation, not just the matrix formula.
  3. Follow-up: explain how a parallel prefix-scan such as Hillis–Steele could parallelize the associative accumulation in a tiled version of this computation — when it helps and when it doesn't.

What This Part Should Cover

  • Index-level derivation: both gradients derived entry-by-entry from Cij=∑rAirBrjC_{ij}=\sum_r A_{ir}B_{rj}Cij​=∑r​Air​Brj​ , then collapsed into the correct matrix form (each gradient a product of dCdCdC with a transpose of the other operand), justified by shape checks ( dAdAdA matches AAA , dBdBdB matches BBB ).
  • Working op: a torch.autograd.Function with save_for_backward , validated with gradcheck ; correct handling of batched ( bmm ) and broadcast reduction (summing the gradient over broadcast axes).
  • Scan judgment: correctly identifies the associative operation, explains Hillis–Steele's step-offset ( 1,2,4,…1,2,4,\dots1,2,4,… ) structure and its O(nlog⁡n)O(n\log n)O(nlogn) work / O(log⁡n)O(\log n)O(logn) depth, and critically states when a plain reduction or a tuned GEMM kernel beats a scan (work-efficiency) versus when every prefix is genuinely required.

What a Strong Answer Covers

These dimensions span both parts:

  • Communication: thinks aloud, states assumptions explicitly, and uses pseudocode and concrete tensor shapes to make reasoning legible to the interviewer.
  • Verification habit: for each fix or derivation, names the smallest test that would prove it correct (deterministic decode, shape asserts, cached-vs-uncached parity, gradcheck ) rather than asserting correctness by inspection.

Follow-up Questions

  • In the KV cache, what is the memory footprint as context length grows, and what techniques (e.g. multi-query / grouped-query attention, paged or quantized caches) reduce it?
  • How would batched generation with different sequence lengths interact with a single shared cache and the causal mask?
  • For Part B, derive the backward when AAA or BBB is broadcast (e.g. a single weight matrix multiplied against a batch of inputs) — over which axes must you sum the gradient?
  • Hillis–Steele is step-efficient but not work-efficient ( O(nlog⁡n)O(n\log n)O(nlogn) work). When would you prefer a Blelloch (work-efficient) scan instead, and what's the tradeoff?
Loading comments...

Browse More Questions

More Machine Learning•More OpenAI•More Machine Learning Engineer•OpenAI Machine Learning Engineer•OpenAI Machine Learning•Machine Learning Engineer Machine Learning

Write your answer

Your first approved answer each day earns 20 XP.

Sign in to write your answer.
PracHub

Master your tech interviews with 8,500+ real questions from top companies.

Product

  • Questions
  • Learning Tracks
  • Interview Guides
  • Resources
  • Premium
  • For Universities
  • Student Access

Browse

  • By Company
  • By Role
  • By Category
  • Topic Hubs
  • SQL Questions
  • AI Coding Questions
  • Compare Platforms
  • Discord Community

Support

  • support@prachub.com
  • (916) 541-4762

Legal

  • Privacy Policy
  • Terms of Service
  • About Us

© 2026 PracHub. All rights reserved.