Two-Layer Neural Network: Backpropagation and Gradient Check (NumPy)
Context
You are implementing a fully connected two-layer neural network for multi-class classification with the architecture:
Affine (XW1 + b1) → ReLU → Affine (·W2 + b2) → softmax → cross-entropy loss.
Assume a mini-batch of N examples, input dimension D, hidden size H, and C classes.
-
X ∈ R^{N×D}, y ∈ {0, 1, ..., C−1}
-
W1 ∈ R^{D×H}, b1 ∈ R^{H}
-
W2 ∈ R^{H×C}, b2 ∈ R^{C}
Task
-
Derive the backpropagation equations for all parameters (W1, b1, W2, b2), clearly applying the chain rule and stating the tensor shapes at each step. Explain any linear algebra identities used (e.g., (AB)^T = B^T A^T, broadcasting rules, d/dW (XW) = X^T upstream).
-
Implement NumPy (Python 3.7+) code to compute the forward pass, softmax cross-entropy loss, and analytical gradients. Use vectorized linear algebra; do not use automatic differentiation or explicit Python loops over the batch.
-
Implement gradient checking via central finite differences on a small synthetic dataset. Report the relative error between analytical and numerical gradients.
-
Demonstrate how you would debug and fix a mismatch between analytical and numerical gradients (e.g., a missing 1/N factor or an incorrect ReLU mask). Explain the symptoms and the fix.
Requirements
-
Use a numerically stable softmax (subtract row-wise max from logits before exponentiating).
-
Show shapes at each layer and for each gradient.
-
Keep code self-contained and runnable with only NumPy. Loops are allowed in gradient checking.
-
Include comments in code indicating shapes and broadcasting.
Deliverables
-
Derivation with shapes and chain rule steps.
-
NumPy code for forward pass, loss, and backprop gradients.
-
Gradient checking routine and example run on a toy dataset.
-
Debugging narrative for a deliberately introduced mismatch and its fix.