How do I approach Machine Learning interview questions?

Machine Learning questions require understanding of core concepts and practice. PracHub provides solutions with explanations to help you master machine learning interviews.

What difficulty level is this interview question?

This is a hard difficulty Machine Learning question, commonly asked during Onsite rounds at OpenAI.

What role is this question designed for?

This question is commonly asked for Machine Learning Engineer candidates at OpenAI during technical interviews.

Implement Backprop for a Tiny Network | OpenAI Interview Question

Q: Implement Backprop for a Tiny Network

This question evaluates understanding of backpropagation, gradient derivation, numerical stability of softmax cross-entropy, and practical implementation skills in both NumPy and PyTorch autograd for a two-layer MLP, including correct batched tensor shapes.

Implement and explain the forward and backward pass of a small two-layer neural network for classification — first from scratch with NumPy, then with PyTorch autograd — and verify the two agree.

The network is a standard MLP:

$X \xrightarrow{\;W_1, b_1\;} Z_1 \xrightarrow{\text{ReLU}} H_1 \xrightarrow{\;W_2, b_2\;} Z_2 \;(\text{logits}) \xrightarrow{\text{softmax + CE}} \text{loss}$

You start with a batched input X of shape [B, D]. A first linear layer (W1, b1) maps it to a hidden representation, a ReLU nonlinearity is applied, and a second linear layer (W2, b2) produces class logits over C classes. The loss is softmax cross-entropy against integer labels y. You must derive and implement gradients for every parameter (W1, b1, W2, b2) and for the input X, keeping all tensor shapes correct for batched computation.

Constraints & Assumptions

Input batch X : shape [B, D] (e.g. B = 64 rows, D = 128 features).
Hidden width H , number of classes C (e.g. H = 256 , C = 10 ).
W1 : [D, H] , b1 : [H] ; W2 : [H, C] , b2 : [C] .
Labels y : integer class indices of shape [B] (not one-hot).
Loss is the mean softmax cross-entropy over the batch.
NumPy version: no autograd, no deep-learning framework — only numpy matmul, broadcasting, and reductions.
You may assume float32/float64 numerics and that the interviewer will spot-check shapes as you go.

Clarifying Questions to Ask

Is the loss reduced by mean or sum over the batch? (It changes the gradient scaling by a factor of $B$ .)
Are labels provided as integer class indices [B] or as one-hot vectors [B, C] ?
Does the second layer output raw logits (so softmax lives inside the loss), or probabilities?
Do you want gradients with respect to the input X as well, or only the parameters?
Should I assume biases are present on both layers, and is there any weight decay / regularization in the loss?
What batch size and dimensions should I use — a tiny case for correctness, or realistic sizes for a performance discussion?

Part 1 — Forward pass and loss (NumPy)

Implement the forward pass: the two linear layers, the ReLU, and a numerically stable softmax cross-entropy. Cache whatever intermediate values the backward pass will need. State the shape of every tensor (Z1, H1, Z2, the loss).

What This Part Should Cover

The exact shape of each intermediate ( Z1 , H1 , Z2 and the scalar loss), stated aloud.
A numerically stable softmax that subtracts the per-row max before exp .
Cross-entropy gathered at the true-class index ( probs[arange(B), y] ), reduced by mean .
A clean, minimal cache of exactly the tensors backward needs — no more, no less.

Part 2 — Manual backward pass (NumPy)

Derive and implement the gradients of the mean loss with respect to W1, b1, W2, b2, and X. Be explicit about where the 1/B averaging factor enters, and verify each gradient has the same shape as the tensor it is the gradient of.

What This Part Should Cover

Correctly identifies that composing softmax and cross-entropy collapses to a single, compact expression for $\partial \mathcal{L}/\partial Z_2$ , with the right shape [B, C] and the correct placement of the $1/B$ averaging factor.
Derives the gradients of both linear layers with correct matrix orientations (no wrong transposes) and the correct batch-axis reduction for each bias gradient, obtaining shapes [D, H] , [H] , [H, C] , [C] respectively.
Propagates the upstream gradient through the input of the first layer to produce dX with the correct shape [B, D] .
Shows that the $1/B$ factor flows downstream from dZ2 automatically and is not re-applied at each layer.
Applies the ReLU backward correctly using the cached pre-activation , names the convention at $z = 0$ , and confirms the gated gradient has the same shape [B, H] as the activation.

Part 3 — PyTorch autograd equivalent

Express the same network with PyTorch tensors and requires_grad, run loss.backward(), and read off W1.grad, b1.grad, W2.grad, b2.grad. Then verify your hand-derived NumPy gradients match autograd's on the same random inputs and weights (e.g. np.allclose / torch.allclose within tolerance). Explain how the computation graph is built and how gradients flow back through it.

What This Part Should Cover

Idiomatic setup: requires_grad=True on the leaves you want gradients for (including X if checking dX ).
A clear account of the dynamic ("define-by-run") graph, leaf vs. non-leaf tensors, and reverse-mode accumulation in backward() .
Why zero_grad() matters (gradients accumulate with += across backward() calls).
A verification that compares all five gradients within a tight tolerance on small dimensions.

What a Strong Answer Covers

These dimensions span all three parts; the interviewer is watching for them throughout, not within a single part.

Correct shapes throughout — every intermediate and every gradient, stated aloud, with the gradient matching the shape of its parameter.
Consistency of the 1/B averaging factor between the forward loss, every backward gradient, and the PyTorch reduction='mean' — the single most common source of a constant-factor mismatch.
A correct verification strategy — manual vs. autograd on the same seeded inputs and weights, and ideally a finite-difference sanity check on a small network.
Awareness of pitfalls that recur across parts: silent broadcasting, a wrong transpose in batched matmul, in-place ops breaking autograd, and confusing logits with probabilities.

Follow-up Questions

How would you extend this to a mini-batch over many layers generically — what would a reusable Layer interface with forward / backward look like (generalizing Parts 1 and 2)?
Why does subtracting the max inside softmax not change the result, and what numerical failure does it prevent?
If loss.backward() is called twice without zero_grad() , what happens to W1.grad , and why?
Where do in-place operations (e.g. relu_() , += on a tensor that's needed for backward) break autograd, and how does PyTorch detect this?
How would you add a finite-difference gradient check, and what tolerance and step size would you choose?

The network is a standard MLP:

$X \xrightarrow{\;W_1, b_1\;} Z_1 \xrightarrow{\text{ReLU}} H_1 \xrightarrow{\;W_2, b_2\;} Z_2 \;(\text{logits}) \xrightarrow{\text{softmax + CE}} \text{loss}$

Constraints & Assumptions

Input batch X : shape [B, D] (e.g. B = 64 rows, D = 128 features).
Hidden width H , number of classes C (e.g. H = 256 , C = 10 ).
W1 : [D, H] , b1 : [H] ; W2 : [H, C] , b2 : [C] .
Labels y : integer class indices of shape [B] (not one-hot).
Loss is the mean softmax cross-entropy over the batch.
NumPy version: no autograd, no deep-learning framework — only numpy matmul, broadcasting, and reductions.
You may assume float32/float64 numerics and that the interviewer will spot-check shapes as you go.

Clarifying Questions to Ask

Is the loss reduced by mean or sum over the batch? (It changes the gradient scaling by a factor of $B$ .)
Are labels provided as integer class indices [B] or as one-hot vectors [B, C] ?
Does the second layer output raw logits (so softmax lives inside the loss), or probabilities?
Do you want gradients with respect to the input X as well, or only the parameters?
Should I assume biases are present on both layers, and is there any weight decay / regularization in the loss?
What batch size and dimensions should I use — a tiny case for correctness, or realistic sizes for a performance discussion?

Part 1 — Forward pass and loss (NumPy)

What This Part Should Cover

The exact shape of each intermediate ( Z1 , H1 , Z2 and the scalar loss), stated aloud.
A numerically stable softmax that subtracts the per-row max before exp .
Cross-entropy gathered at the true-class index ( probs[arange(B), y] ), reduced by mean .
A clean, minimal cache of exactly the tensors backward needs — no more, no less.

Part 2 — Manual backward pass (NumPy)

What This Part Should Cover

Correctly identifies that composing softmax and cross-entropy collapses to a single, compact expression for $\partial \mathcal{L}/\partial Z_2$ , with the right shape [B, C] and the correct placement of the $1/B$ averaging factor.
Derives the gradients of both linear layers with correct matrix orientations (no wrong transposes) and the correct batch-axis reduction for each bias gradient, obtaining shapes [D, H] , [H] , [H, C] , [C] respectively.
Propagates the upstream gradient through the input of the first layer to produce dX with the correct shape [B, D] .
Shows that the $1/B$ factor flows downstream from dZ2 automatically and is not re-applied at each layer.
Applies the ReLU backward correctly using the cached pre-activation , names the convention at $z = 0$ , and confirms the gated gradient has the same shape [B, H] as the activation.

Part 3 — PyTorch autograd equivalent

What This Part Should Cover

Idiomatic setup: requires_grad=True on the leaves you want gradients for (including X if checking dX ).
A clear account of the dynamic ("define-by-run") graph, leaf vs. non-leaf tensors, and reverse-mode accumulation in backward() .
Why zero_grad() matters (gradients accumulate with += across backward() calls).
A verification that compares all five gradients within a tight tolerance on small dimensions.

What a Strong Answer Covers

These dimensions span all three parts; the interviewer is watching for them throughout, not within a single part.

Correct shapes throughout — every intermediate and every gradient, stated aloud, with the gradient matching the shape of its parameter.
Consistency of the 1/B averaging factor between the forward loss, every backward gradient, and the PyTorch reduction='mean' — the single most common source of a constant-factor mismatch.
A correct verification strategy — manual vs. autograd on the same seeded inputs and weights, and ideally a finite-difference sanity check on a small network.
Awareness of pitfalls that recur across parts: silent broadcasting, a wrong transpose in batched matmul, in-place ops breaking autograd, and confusing logits with probabilities.

Follow-up Questions

How would you extend this to a mini-batch over many layers generically — what would a reusable Layer interface with forward / backward look like (generalizing Parts 1 and 2)?
Why does subtracting the max inside softmax not change the result, and what numerical failure does it prevent?
If loss.backward() is called twice without zero_grad() , what happens to W1.grad , and why?
Where do in-place operations (e.g. relu_() , += on a tensor that's needed for backward) break autograd, and how does PyTorch detect this?
How would you add a finite-difference gradient check, and what tolerance and step size would you choose?

Implement Backprop for a Tiny Network

Quick Overview

Constraints & Assumptions

Clarifying Questions to Ask

Part 1 — Forward pass and loss (NumPy)

What This Part Should Cover

Part 2 — Manual backward pass (NumPy)

What This Part Should Cover

Part 3 — PyTorch autograd equivalent

What This Part Should Cover

What a Strong Answer Covers

Follow-up Questions

Solution

Submit Your Answer to Earn 20XP

Implement Backprop for a Tiny Network

Quick Overview

Constraints & Assumptions

Clarifying Questions to Ask

Part 1 — Forward pass and loss (NumPy)

What This Part Should Cover

Part 2 — Manual backward pass (NumPy)

What This Part Should Cover

Part 3 — PyTorch autograd equivalent

What This Part Should Cover

What a Strong Answer Covers

Follow-up Questions

Solution

Submit Your Answer to Earn 20XP