This question evaluates understanding of backpropagation and the chain rule, proficiency in vectorized NumPy implementation, numerical stability of softmax/cross-entropy, and competency in gradient checking and debugging of model gradients.
Derive backpropagation for a two-layer neural network (Affine → ReLU → Affine → softmax). Write NumPy (Python 3.
7) code to compute the forward pass, loss, and analytical gradients without automatic differentiation using vectorized linear algebra. Implement gradient checking via finite differences and demonstrate how you would debug and fix a mismatch between analytical and numerical gradients. Clearly state tensor shapes at each layer, show the chain rule application, and explain any linear algebra identities used (e.g., broadcasting rules, transpose of products).
Quick Answer: This question evaluates understanding of backpropagation and the chain rule, proficiency in vectorized NumPy implementation, numerical stability of softmax/cross-entropy, and competency in gradient checking and debugging of model gradients.
Assume a mini-batch of N examples, input dimension D, hidden size H, and C classes.
Tensor
Shape
Meaning
X
N × D
mini-batch of inputs
y
(N,)
integer class labels in {0, 1, …, C−1}
W1, b1
D × H, (H,)
first affine layer
W2, b2
H × C, (C,)
second affine layer
This is a coding-and-debugging interview: it is as much about converting conceptual understanding into correct code and debugging skills as it is about the math. Across the parts below you will (1) derive the backprop equations, (2) implement the forward/loss/backward in NumPy, (3) build a gradient checker, and (4) walk through diagnosing a deliberate mismatch.
Constraints & Assumptions
Language/libraries:
Python 3.7+, NumPy only. No automatic differentiation (no PyTorch/TensorFlow/JAX/
autograd
).
Vectorization:
No explicit Python loops over the batch in the forward/backward path. (Loops
are
permitted inside the gradient checker.)
Loss:
Mean (not sum) cross-entropy over the mini-batch, i.e. averaged by
1/N
.
Numerical stability:
Softmax must subtract the row-wise max from the logits before exponentiating.
Scale of the toy problem:
Small enough to grad-check by brute force (e.g.
N≤10
, with
D,H,C
in the single/low-double digits) so finite differences run quickly.
Precision:
float64 throughout; finite-difference step
ε≈10−5
, central differences.
Clarifying Questions to Ask
Is the loss the
mean
over the batch or the
sum
? (This fixes whether a
1/N
factor appears in the gradients.)
Should I include an
L2 regularization
term on the weights, or just the data loss?
Do you want the gradient w.r.t. the
input X
as well, or only the parameter gradients (
W1, b1, W2, b2
)?
What is the
ReLU convention at exactly z=0
— subgradient 0 or 1? (Does it matter for grad-checking?)
What
tolerance
counts as "passing" the gradient check, and on what dataset size should I demonstrate it?
Can I assume labels
y
are
integer class indices
(not one-hot)?
Part 1 — Derive the backpropagation equations
Derive ∂L/∂W1, ∂L/∂b1, ∂L/∂W2, ∂L/∂b2.
Apply the chain rule explicitly, layer by layer.
State the
tensor shape at each step
.
Explain the linear-algebra identities you use — e.g.
(AB)^T = B^T A^T
, the broadcasting rule for the biases, and how the affine gradient
∂L/∂W = X^T · upstream
arises.
What This Part Should Cover
Chain-rule fluency
— clean layer-by-layer decomposition; the fused softmax+CE result derived (not just quoted), with the
1/N
factor placed correctly.
Shape discipline
— a stated shape at every intermediate, and
grad.shape == param.shape
treated as the invariant that pins down each transpose.
Identity justification
—
(AB)^T = B^T A^T
, the bias-as-sum-over-rows argument, and why
∂L/∂W=XTG
rather than
GXT
.
Part 2 — Implement the NumPy code
Implement the forward pass, the softmax cross-entropy loss, and the analytical gradients in NumPy (Python 3.7+).
Use vectorized linear algebra; no automatic differentiation and no explicit Python loops over the batch.
Use a
numerically stable softmax
(subtract the row-wise max before exponentiating).
Include in-code
comments indicating shapes and broadcasting
at each layer and each gradient.
What This Part Should Cover
Vectorized, autodiff-free code
that runs as written with only NumPy, with no batch loops on the forward/backward path.
Numerical stability
— the row-max-subtracting softmax, and an argument for why it is exact.
The fused backward
—
dScores
formed in one shot without materializing a per-example Jacobian, with the single
1/N
applied once.
Self-documenting shapes
— inline comments asserting the shape and the broadcast at every layer and gradient.
Part 3 — Implement gradient checking
Verify your analytical gradients against central finite differences on a small synthetic (toy) dataset, and report the relative error.
What This Part Should Cover
A correct two-sided estimator
with strict one-entry-at-a-time perturbation and guaranteed parameter restoration.
A sound relative-error metric
(
∣a−b∣/(∣a∣+∣b∣)
, not floored at 1.0) plus a stated pass/fail threshold.
A demonstrated run
on a toy dataset with a defensible
ε
and a loss sanity-check against
logC
.
Part 4 — Demonstrate debugging
Show how you would diagnose and fix a mismatch between the analytical and numerical gradients (for example, a missing 1/N averaging factor or an incorrect ReLU mask). Explain the symptom, how you localize it, and the fix.
What This Part Should Cover
Symptom → localization → fix
for at least two distinct bugs (e.g. the
1/N
factor and the ReLU mask), each with its tell-tale signature.
Reading the signature
— uniform-factor-across-all-params vs. layer-localized vs. shape-crash vs.
nan
, and what each implies.
Active isolation
— toggling
N
, forcing dead ReLU units, and shape assertions used as deliberate probes, not guesses.
What a Strong Answer Covers
These signals span all four parts:
Shape discipline as an invariant
carried from the derivation into the code into the debugging.
Correct, single placement of the 1/N factor
, and the ability to detect a violation by its symptom.
Communication
— narrating conventions and trade-offs out loud (e.g. masking the pre-activation vs. the activation, why central differences over one-sided).
Follow-up Questions
How does the cost of finite-difference gradient checking scale with the number of parameters, and why is it used only as a
test
rather than for training?
ReLU is non-differentiable at
z=0
. Why does the gradient check still pass in practice, and when could a unit sitting exactly on the kink actually cause a mismatch?
If you added
L2 regularization
to the loss, what exactly changes in
dW1
and
dW2
, and what is the classic self-inflicted grad-check failure that results from forgetting it?
How would this derivation and code generalize to an
L-layer
network, and which quantity that you computed here (but didn't strictly need) becomes essential?