PracHub
QuestionsPremiumCoachesLearningGuidesInterview Prep
|Home/Machine Learning/OpenAI

Implement Backprop for a Tiny Network

Last updated: Jun 21, 2026

Quick Overview

This question evaluates understanding of backpropagation, gradient derivation, numerical stability of softmax cross-entropy, and practical implementation skills in both NumPy and PyTorch autograd for a two-layer MLP, including correct batched tensor shapes.

  • hard
  • OpenAI
  • Machine Learning
  • Machine Learning Engineer

Implement Backprop for a Tiny Network

Company: OpenAI

Role: Machine Learning Engineer

Category: Machine Learning

Difficulty: hard

Interview Round: Onsite

Implement and explain the forward and backward pass of a small two-layer neural network for classification — first **from scratch with NumPy**, then with **PyTorch autograd** — and verify the two agree. The network is a standard MLP: $$X \xrightarrow{\;W_1, b_1\;} Z_1 \xrightarrow{\text{ReLU}} H_1 \xrightarrow{\;W_2, b_2\;} Z_2 \;(\text{logits}) \xrightarrow{\text{softmax + CE}} \text{loss}$$ You start with a batched input `X` of shape `[B, D]`. A first linear layer (`W1`, `b1`) maps it to a hidden representation, a ReLU nonlinearity is applied, and a second linear layer (`W2`, `b2`) produces class logits over `C` classes. The loss is softmax cross-entropy against integer labels `y`. You must derive and implement gradients for **every** parameter (`W1`, `b1`, `W2`, `b2`) and for the input `X`, keeping all tensor shapes correct for batched computation. ### Constraints & Assumptions - Input batch `X`: shape `[B, D]` (e.g. `B = 64` rows, `D = 128` features). - Hidden width `H`, number of classes `C` (e.g. `H = 256`, `C = 10`). - `W1`: `[D, H]`, `b1`: `[H]`; `W2`: `[H, C]`, `b2`: `[C]`. - Labels `y`: integer class indices of shape `[B]` (not one-hot). - Loss is the **mean** softmax cross-entropy over the batch. - NumPy version: no autograd, no deep-learning framework — only `numpy` matmul, broadcasting, and reductions. - You may assume float32/float64 numerics and that the interviewer will spot-check shapes as you go. ### Clarifying Questions to Ask - Is the loss reduced by **mean** or **sum** over the batch? (It changes the gradient scaling by a factor of $B$.) - Are labels provided as integer class indices `[B]` or as one-hot vectors `[B, C]`? - Does the second layer output **raw logits** (so softmax lives inside the loss), or probabilities? - Do you want gradients with respect to the input `X` as well, or only the parameters? - Should I assume biases are present on both layers, and is there any weight decay / regularization in the loss? - What batch size and dimensions should I use — a tiny case for correctness, or realistic sizes for a performance discussion? ### Part 1 — Forward pass and loss (NumPy) Implement the forward pass: the two linear layers, the ReLU, and a **numerically stable** softmax cross-entropy. Cache whatever intermediate values the backward pass will need. State the shape of every tensor (`Z1`, `H1`, `Z2`, the loss). ```hint Numerical stability A naive `softmax` overflows on large logits. Subtract the per-row max before exponentiating: $\text{softmax}(z)_i = \dfrac{e^{z_i - \max z}}{\sum_j e^{z_j - \max z}}$. This is exact, not an approximation. ``` ```hint Bias broadcasting `b1` is `[H]` and `Z1` is `[B, H]`. Broadcasting adds `b1` to every row — confirm the axis is right so you're adding one bias *per output unit*, not per example. ``` ```hint What to cache The backward pass will re-read the pre-activation `Z1` (for the ReLU mask), the activation `H1`, the input `X`, both weight matrices, and the softmax probabilities. Decide *now* what to stash so you never recompute the forward pass. ``` #### What This Part Should Cover - The exact shape of each intermediate (`Z1`, `H1`, `Z2` and the scalar loss), stated aloud. - A numerically stable softmax that subtracts the per-row max before `exp`. - Cross-entropy gathered at the true-class index (`probs[arange(B), y]`), reduced by **mean**. - A clean, minimal cache of exactly the tensors backward needs — no more, no less. ### Part 2 — Manual backward pass (NumPy) Derive and implement the gradients of the mean loss with respect to `W1`, `b1`, `W2`, `b2`, and `X`. Be explicit about where the `1/B` averaging factor enters, and verify each gradient has the same shape as the tensor it is the gradient of. ```hint Where to start Don't differentiate softmax and cross-entropy separately — composing them produces one famously clean closed form for $\partial \mathcal{L}/\partial Z_2$ in terms of the predicted probabilities, the labels, and the batch size. Derive that single expression first, then chain-rule backward from it. ``` ```hint Linear-layer gradient pattern A single linear layer $Z = A W + b$ has the same three backward formulas no matter where it sits — one for $dW$, one for $db$, one for the input $dA$. Derive them once (transposes and the batch-axis sum fall out of matching shapes), then reuse the identical pattern for both layers. ``` ```hint ReLU gradient ReLU passes the upstream gradient through only where it was active. Gate $dH_1$ with a boolean mask computed from the *pre-activation* `Z1` you cached (not `H1`), and decide your convention at exactly $z=0$. ``` #### What This Part Should Cover - Correctly identifies that composing softmax and cross-entropy collapses to a single, compact expression for $\partial \mathcal{L}/\partial Z_2$, with the right shape `[B, C]` and the correct placement of the $1/B$ averaging factor. - Derives the gradients of both linear layers with correct matrix orientations (no wrong transposes) and the correct batch-axis reduction for each bias gradient, obtaining shapes `[D, H]`, `[H]`, `[H, C]`, `[C]` respectively. - Propagates the upstream gradient through the input of the first layer to produce `dX` with the correct shape `[B, D]`. - Shows that the $1/B$ factor flows downstream from `dZ2` automatically and is not re-applied at each layer. - Applies the ReLU backward correctly using the cached **pre-activation**, names the convention at $z = 0$, and confirms the gated gradient has the same shape `[B, H]` as the activation. ### Part 3 — PyTorch autograd equivalent Express the same network with PyTorch tensors and `requires_grad`, run `loss.backward()`, and read off `W1.grad`, `b1.grad`, `W2.grad`, `b2.grad`. Then verify your hand-derived NumPy gradients match autograd's on the same random inputs and weights (e.g. `np.allclose` / `torch.allclose` within tolerance). Explain how the computation graph is built and how gradients flow back through it. ```hint Apples-to-apples comparison A mismatch here is almost always a setup difference, not a wrong derivation. Make sure both implementations see identical weights and `X`, and that the loss reductions agree so the batch-averaging factor lines up — otherwise gradients differ by a constant factor. ``` ```hint Logits in, not probabilities `F.cross_entropy` expects **raw logits** and applies a stable log-softmax internally. Passing softmaxed probabilities in is a silent double-softmax bug. ``` #### What This Part Should Cover - Idiomatic setup: `requires_grad=True` on the leaves you want gradients for (including `X` if checking `dX`). - A clear account of the dynamic ("define-by-run") graph, leaf vs. non-leaf tensors, and reverse-mode accumulation in `backward()`. - Why `zero_grad()` matters (gradients accumulate with `+=` across `backward()` calls). - A verification that compares all five gradients within a tight tolerance on small dimensions. ### What a Strong Answer Covers These dimensions span all three parts; the interviewer is watching for them throughout, not within a single part. - **Correct shapes throughout** — every intermediate and every gradient, stated aloud, with the gradient matching the shape of its parameter. - **Consistency of the `1/B` averaging factor** between the forward loss, every backward gradient, and the PyTorch `reduction='mean'` — the single most common source of a constant-factor mismatch. - **A correct verification strategy** — manual vs. autograd on the *same* seeded inputs and weights, and ideally a finite-difference sanity check on a small network. - **Awareness of pitfalls** that recur across parts: silent broadcasting, a wrong transpose in batched matmul, in-place ops breaking autograd, and confusing logits with probabilities. ### Follow-up Questions - How would you extend this to a **mini-batch over many layers** generically — what would a reusable `Layer` interface with `forward`/`backward` look like (generalizing Parts 1 and 2)? - Why does subtracting the max inside softmax not change the result, and what numerical failure does it prevent? - If `loss.backward()` is called twice without `zero_grad()`, what happens to `W1.grad`, and why? - Where do in-place operations (e.g. `relu_()`, `+=` on a tensor that's needed for backward) break autograd, and how does PyTorch detect this? - How would you add a finite-difference gradient check, and what tolerance and step size would you choose?

Quick Answer: This question evaluates understanding of backpropagation, gradient derivation, numerical stability of softmax cross-entropy, and practical implementation skills in both NumPy and PyTorch autograd for a two-layer MLP, including correct batched tensor shapes.

Related Interview Questions

  • Implement 1NN with NumPy - OpenAI (medium)
  • Compute entropy and implement 1-NN - OpenAI (medium)
  • Defend a Research Direction and Experiment Design - OpenAI (medium)
  • Debug MiniGPT and Backpropagate Matmul - OpenAI (medium)
  • Filter Bad Human Annotations - OpenAI (medium)
OpenAI logo
OpenAI
Apr 3, 2026, 12:00 AM
Machine Learning Engineer
Onsite
Machine Learning
124
0
Loading...

Implement and explain the forward and backward pass of a small two-layer neural network for classification — first from scratch with NumPy, then with PyTorch autograd — and verify the two agree.

The network is a standard MLP:

X→  W1,b1  Z1→ReLUH1→  W2,b2  Z2  (logits)→softmax + CElossX \xrightarrow{\;W_1, b_1\;} Z_1 \xrightarrow{\text{ReLU}} H_1 \xrightarrow{\;W_2, b_2\;} Z_2 \;(\text{logits}) \xrightarrow{\text{softmax + CE}} \text{loss}XW1​,b1​​Z1​ReLU​H1​W2​,b2​​Z2​(logits)softmax + CE​loss

You start with a batched input X of shape [B, D]. A first linear layer (W1, b1) maps it to a hidden representation, a ReLU nonlinearity is applied, and a second linear layer (W2, b2) produces class logits over C classes. The loss is softmax cross-entropy against integer labels y. You must derive and implement gradients for every parameter (W1, b1, W2, b2) and for the input X, keeping all tensor shapes correct for batched computation.

Constraints & Assumptions

  • Input batch X : shape [B, D] (e.g. B = 64 rows, D = 128 features).
  • Hidden width H , number of classes C (e.g. H = 256 , C = 10 ).
  • W1 : [D, H] , b1 : [H] ; W2 : [H, C] , b2 : [C] .
  • Labels y : integer class indices of shape [B] (not one-hot).
  • Loss is the mean softmax cross-entropy over the batch.
  • NumPy version: no autograd, no deep-learning framework — only numpy matmul, broadcasting, and reductions.
  • You may assume float32/float64 numerics and that the interviewer will spot-check shapes as you go.

Clarifying Questions to Ask

  • Is the loss reduced by mean or sum over the batch? (It changes the gradient scaling by a factor of BBB .)
  • Are labels provided as integer class indices [B] or as one-hot vectors [B, C] ?
  • Does the second layer output raw logits (so softmax lives inside the loss), or probabilities?
  • Do you want gradients with respect to the input X as well, or only the parameters?
  • Should I assume biases are present on both layers, and is there any weight decay / regularization in the loss?
  • What batch size and dimensions should I use — a tiny case for correctness, or realistic sizes for a performance discussion?

Part 1 — Forward pass and loss (NumPy)

Implement the forward pass: the two linear layers, the ReLU, and a numerically stable softmax cross-entropy. Cache whatever intermediate values the backward pass will need. State the shape of every tensor (Z1, H1, Z2, the loss).

What This Part Should Cover

  • The exact shape of each intermediate ( Z1 , H1 , Z2 and the scalar loss), stated aloud.
  • A numerically stable softmax that subtracts the per-row max before exp .
  • Cross-entropy gathered at the true-class index ( probs[arange(B), y] ), reduced by mean .
  • A clean, minimal cache of exactly the tensors backward needs — no more, no less.

Part 2 — Manual backward pass (NumPy)

Derive and implement the gradients of the mean loss with respect to W1, b1, W2, b2, and X. Be explicit about where the 1/B averaging factor enters, and verify each gradient has the same shape as the tensor it is the gradient of.

What This Part Should Cover

  • Correctly identifies that composing softmax and cross-entropy collapses to a single, compact expression for ∂L/∂Z2\partial \mathcal{L}/\partial Z_2∂L/∂Z2​ , with the right shape [B, C] and the correct placement of the 1/B1/B1/B averaging factor.
  • Derives the gradients of both linear layers with correct matrix orientations (no wrong transposes) and the correct batch-axis reduction for each bias gradient, obtaining shapes [D, H] , [H] , [H, C] , [C] respectively.
  • Propagates the upstream gradient through the input of the first layer to produce dX with the correct shape [B, D] .
  • Shows that the 1/B1/B1/B factor flows downstream from dZ2 automatically and is not re-applied at each layer.
  • Applies the ReLU backward correctly using the cached pre-activation , names the convention at z=0z = 0z=0 , and confirms the gated gradient has the same shape [B, H] as the activation.

Part 3 — PyTorch autograd equivalent

Express the same network with PyTorch tensors and requires_grad, run loss.backward(), and read off W1.grad, b1.grad, W2.grad, b2.grad. Then verify your hand-derived NumPy gradients match autograd's on the same random inputs and weights (e.g. np.allclose / torch.allclose within tolerance). Explain how the computation graph is built and how gradients flow back through it.

What This Part Should Cover

  • Idiomatic setup: requires_grad=True on the leaves you want gradients for (including X if checking dX ).
  • A clear account of the dynamic ("define-by-run") graph, leaf vs. non-leaf tensors, and reverse-mode accumulation in backward() .
  • Why zero_grad() matters (gradients accumulate with += across backward() calls).
  • A verification that compares all five gradients within a tight tolerance on small dimensions.

What a Strong Answer Covers

These dimensions span all three parts; the interviewer is watching for them throughout, not within a single part.

  • Correct shapes throughout — every intermediate and every gradient, stated aloud, with the gradient matching the shape of its parameter.
  • Consistency of the 1/B averaging factor between the forward loss, every backward gradient, and the PyTorch reduction='mean' — the single most common source of a constant-factor mismatch.
  • A correct verification strategy — manual vs. autograd on the same seeded inputs and weights, and ideally a finite-difference sanity check on a small network.
  • Awareness of pitfalls that recur across parts: silent broadcasting, a wrong transpose in batched matmul, in-place ops breaking autograd, and confusing logits with probabilities.

Follow-up Questions

  • How would you extend this to a mini-batch over many layers generically — what would a reusable Layer interface with forward / backward look like (generalizing Parts 1 and 2)?
  • Why does subtracting the max inside softmax not change the result, and what numerical failure does it prevent?
  • If loss.backward() is called twice without zero_grad() , what happens to W1.grad , and why?
  • Where do in-place operations (e.g. relu_() , += on a tensor that's needed for backward) break autograd, and how does PyTorch detect this?
  • How would you add a finite-difference gradient check, and what tolerance and step size would you choose?

Solution

Show

Submit Your Answer to Earn 20XP

Sign in to leave a comment

Loading comments...

Browse More Questions

More Machine Learning•More OpenAI•More Machine Learning Engineer•OpenAI Machine Learning Engineer•OpenAI Machine Learning•Machine Learning Engineer Machine Learning
PracHub

Master your tech interviews with 8,000+ real questions from top companies.

Product

  • Questions
  • Learning Tracks
  • Interview Guides
  • Resources
  • Premium
  • For Universities
  • Student Access

Browse

  • By Company
  • By Role
  • By Category
  • Topic Hubs
  • SQL Questions
  • Compare Platforms
  • Discord Community

Support

  • support@prachub.com
  • (916) 541-4762

Legal

  • Privacy Policy
  • Terms of Service
  • About Us

© 2026 PracHub. All rights reserved.