Implement Backprop for a Tiny Network
Company: OpenAI
Role: Machine Learning Engineer
Category: Machine Learning
Difficulty: hard
Interview Round: Onsite
Implement and explain the forward and backward pass of a small two-layer neural network for classification — first **from scratch with NumPy**, then with **PyTorch autograd** — and verify the two agree.
The network is a standard MLP:
$$X \xrightarrow{\;W_1, b_1\;} Z_1 \xrightarrow{\text{ReLU}} H_1 \xrightarrow{\;W_2, b_2\;} Z_2 \;(\text{logits}) \xrightarrow{\text{softmax + CE}} \text{loss}$$
You start with a batched input `X` of shape `[B, D]`. A first linear layer (`W1`, `b1`) maps it to a hidden representation, a ReLU nonlinearity is applied, and a second linear layer (`W2`, `b2`) produces class logits over `C` classes. The loss is softmax cross-entropy against integer labels `y`. You must derive and implement gradients for **every** parameter (`W1`, `b1`, `W2`, `b2`) and for the input `X`, keeping all tensor shapes correct for batched computation.
### Constraints & Assumptions
- Input batch `X`: shape `[B, D]` (e.g. `B = 64` rows, `D = 128` features).
- Hidden width `H`, number of classes `C` (e.g. `H = 256`, `C = 10`).
- `W1`: `[D, H]`, `b1`: `[H]`; `W2`: `[H, C]`, `b2`: `[C]`.
- Labels `y`: integer class indices of shape `[B]` (not one-hot).
- Loss is the **mean** softmax cross-entropy over the batch.
- NumPy version: no autograd, no deep-learning framework — only `numpy` matmul, broadcasting, and reductions.
- You may assume float32/float64 numerics and that the interviewer will spot-check shapes as you go.
### Clarifying Questions to Ask
- Is the loss reduced by **mean** or **sum** over the batch? (It changes the gradient scaling by a factor of $B$.)
- Are labels provided as integer class indices `[B]` or as one-hot vectors `[B, C]`?
- Does the second layer output **raw logits** (so softmax lives inside the loss), or probabilities?
- Do you want gradients with respect to the input `X` as well, or only the parameters?
- Should I assume biases are present on both layers, and is there any weight decay / regularization in the loss?
- What batch size and dimensions should I use — a tiny case for correctness, or realistic sizes for a performance discussion?
### Part 1 — Forward pass and loss (NumPy)
Implement the forward pass: the two linear layers, the ReLU, and a **numerically stable** softmax cross-entropy. Cache whatever intermediate values the backward pass will need. State the shape of every tensor (`Z1`, `H1`, `Z2`, the loss).
```hint Numerical stability
A naive `softmax` overflows on large logits. Subtract the per-row max before exponentiating: $\text{softmax}(z)_i = \dfrac{e^{z_i - \max z}}{\sum_j e^{z_j - \max z}}$. This is exact, not an approximation.
```
```hint Bias broadcasting
`b1` is `[H]` and `Z1` is `[B, H]`. Broadcasting adds `b1` to every row — confirm the axis is right so you're adding one bias *per output unit*, not per example.
```
```hint What to cache
The backward pass will re-read the pre-activation `Z1` (for the ReLU mask), the activation `H1`, the input `X`, both weight matrices, and the softmax probabilities. Decide *now* what to stash so you never recompute the forward pass.
```
#### What This Part Should Cover
- The exact shape of each intermediate (`Z1`, `H1`, `Z2` and the scalar loss), stated aloud.
- A numerically stable softmax that subtracts the per-row max before `exp`.
- Cross-entropy gathered at the true-class index (`probs[arange(B), y]`), reduced by **mean**.
- A clean, minimal cache of exactly the tensors backward needs — no more, no less.
### Part 2 — Manual backward pass (NumPy)
Derive and implement the gradients of the mean loss with respect to `W1`, `b1`, `W2`, `b2`, and `X`. Be explicit about where the `1/B` averaging factor enters, and verify each gradient has the same shape as the tensor it is the gradient of.
```hint Where to start
Don't differentiate softmax and cross-entropy separately — composing them produces one famously clean closed form for $\partial \mathcal{L}/\partial Z_2$ in terms of the predicted probabilities, the labels, and the batch size. Derive that single expression first, then chain-rule backward from it.
```
```hint Linear-layer gradient pattern
A single linear layer $Z = A W + b$ has the same three backward formulas no matter where it sits — one for $dW$, one for $db$, one for the input $dA$. Derive them once (transposes and the batch-axis sum fall out of matching shapes), then reuse the identical pattern for both layers.
```
```hint ReLU gradient
ReLU passes the upstream gradient through only where it was active. Gate $dH_1$ with a boolean mask computed from the *pre-activation* `Z1` you cached (not `H1`), and decide your convention at exactly $z=0$.
```
#### What This Part Should Cover
- Correctly identifies that composing softmax and cross-entropy collapses to a single, compact expression for $\partial \mathcal{L}/\partial Z_2$, with the right shape `[B, C]` and the correct placement of the $1/B$ averaging factor.
- Derives the gradients of both linear layers with correct matrix orientations (no wrong transposes) and the correct batch-axis reduction for each bias gradient, obtaining shapes `[D, H]`, `[H]`, `[H, C]`, `[C]` respectively.
- Propagates the upstream gradient through the input of the first layer to produce `dX` with the correct shape `[B, D]`.
- Shows that the $1/B$ factor flows downstream from `dZ2` automatically and is not re-applied at each layer.
- Applies the ReLU backward correctly using the cached **pre-activation**, names the convention at $z = 0$, and confirms the gated gradient has the same shape `[B, H]` as the activation.
### Part 3 — PyTorch autograd equivalent
Express the same network with PyTorch tensors and `requires_grad`, run `loss.backward()`, and read off `W1.grad`, `b1.grad`, `W2.grad`, `b2.grad`. Then verify your hand-derived NumPy gradients match autograd's on the same random inputs and weights (e.g. `np.allclose` / `torch.allclose` within tolerance). Explain how the computation graph is built and how gradients flow back through it.
```hint Apples-to-apples comparison
A mismatch here is almost always a setup difference, not a wrong derivation. Make sure both implementations see identical weights and `X`, and that the loss reductions agree so the batch-averaging factor lines up — otherwise gradients differ by a constant factor.
```
```hint Logits in, not probabilities
`F.cross_entropy` expects **raw logits** and applies a stable log-softmax internally. Passing softmaxed probabilities in is a silent double-softmax bug.
```
#### What This Part Should Cover
- Idiomatic setup: `requires_grad=True` on the leaves you want gradients for (including `X` if checking `dX`).
- A clear account of the dynamic ("define-by-run") graph, leaf vs. non-leaf tensors, and reverse-mode accumulation in `backward()`.
- Why `zero_grad()` matters (gradients accumulate with `+=` across `backward()` calls).
- A verification that compares all five gradients within a tight tolerance on small dimensions.
### What a Strong Answer Covers
These dimensions span all three parts; the interviewer is watching for them throughout, not within a single part.
- **Correct shapes throughout** — every intermediate and every gradient, stated aloud, with the gradient matching the shape of its parameter.
- **Consistency of the `1/B` averaging factor** between the forward loss, every backward gradient, and the PyTorch `reduction='mean'` — the single most common source of a constant-factor mismatch.
- **A correct verification strategy** — manual vs. autograd on the *same* seeded inputs and weights, and ideally a finite-difference sanity check on a small network.
- **Awareness of pitfalls** that recur across parts: silent broadcasting, a wrong transpose in batched matmul, in-place ops breaking autograd, and confusing logits with probabilities.
### Follow-up Questions
- How would you extend this to a **mini-batch over many layers** generically — what would a reusable `Layer` interface with `forward`/`backward` look like (generalizing Parts 1 and 2)?
- Why does subtracting the max inside softmax not change the result, and what numerical failure does it prevent?
- If `loss.backward()` is called twice without `zero_grad()`, what happens to `W1.grad`, and why?
- Where do in-place operations (e.g. `relu_()`, `+=` on a tensor that's needed for backward) break autograd, and how does PyTorch detect this?
- How would you add a finite-difference gradient check, and what tolerance and step size would you choose?
Quick Answer: This question evaluates understanding of backpropagation, gradient derivation, numerical stability of softmax cross-entropy, and practical implementation skills in both NumPy and PyTorch autograd for a two-layer MLP, including correct batched tensor shapes.