From first principles, implement a tiny neural network for binary classification with one hidden layer. Write the forward pass, a cross-entropy loss, and the complete backward pass (no autograd). Derive analytic gradients for all parameters, implement gradient descent, and verify correctness with finite-difference gradient checking. Discuss numerical stability (e.g., log-sum-exp), initialization, activation choices, and how batch size affects variance.
Quick Answer: This question evaluates understanding and practical implementation of neural network fundamentals, specifically analytic backpropagation and gradient derivation, numerically stable binary cross-entropy computation, parameter initialization, gradient descent updates, and gradient checking.
Tiny Neural Network From First Principles: Binary Classification
Implement and analyze a minimal neural network for binary classification with a single hidden layer, using vectorized NumPy (or a similar array library) without autograd — every gradient must be derived and coded by hand.
Assume a dataset with features X∈RN×D and labels y∈{0,1}N. The network is:
Hidden layer:H
units with an activation
f
(ReLU or tanh).
Output layer:
a single unit with a sigmoid producing
P(y=1∣x)
.
The deliverable is a complete, self-contained training pipeline (forward, loss, backward, optimization, gradient check) plus a short written discussion of the numerical and design choices. The question is split into six parts below.
Computation is
vectorized over the batch
— no Python loops over the
N
examples in the forward/backward path.
No autograd / no deep-learning framework gradients
— analytic derivatives only. Finite differences are allowed solely for the gradient-check verification step (Part 5).
Work in
float64
for the implementation and especially for gradient checking.
The loss is the
mean
(not sum) binary cross-entropy over the batch.
Clarifying Questions to Ask
Should the loss be the
mean or sum
over the batch? (This decision changes the gradient scale and how the learning rate is interpreted.)
Which hidden activation is in scope —
ReLU, tanh, or both
? (It changes the derivative term and the recommended initialization.)
Is
mini-batch SGD
expected, or is full-batch / single-batch gradient descent sufficient for the deliverable?
What floating-point
precision
should the reference target, and how tight should the gradient-check tolerance be?
Should the numerically-stable loss be expressed in terms of the
probability p or the logit z2
? (This is the crux of the stability design.)
Is a runnable end-to-end demo on a toy dataset expected, or just the component functions?
Part 1 — Forward pass
Compute, in vectorized form:
z1=XW1+b1,a1=f(z1),z2=a1W2+b2,p=σ(z2),
where σ is the logistic sigmoid. State the shape of each intermediate and confirm the biases broadcast correctly across the N rows.
What This Part Should Cover Premium
Part 2 — Loss (numerically stable binary cross-entropy)
Implement mean binary cross-entropy. The naive form −[ylogp+(1−y)log(1−p)] blows up once p rounds to exactly 0 or 1. Use a stable formulation (e.g. softplus log(1+ex) or log-sum-exp) so the loss never produces ±∞ or NaN.
What This Part Should Cover Premium
Part 3 — Backward pass (analytic gradients, no autograd)
Derive and implement ∂L/∂W1, ∂L/∂b1, ∂L/∂W2, ∂L/∂b2 by the chain rule. State the derivative of your hidden activation explicitly.
What This Part Should Cover Premium
Part 4 — Optimization
Implement vanilla gradient-descent updates θ←θ−ηgθ for all four parameters.
What This Part Should Cover Premium
Part 5 — Gradient checking
Verify your analytic gradients against central finite differences
Initialization
— He vs Xavier, and bias initialization.
Activation choices
— ReLU, tanh, sigmoid; pros and cons.
Batch size and gradient variance
— and how it informs learning-rate scaling.
What This Part Should Cover Premium
What a Strong Answer Covers Premium
Follow-up Questions
Where exactly does the naive
1 - p
lose precision, and which primitive (
expm1
/
log1p
) recovers the tiny complement probability when
p
saturates to
1
?
A hidden ReLU unit outputs
0
for
every
training example. What has happened, how would you detect it, and how would you mitigate it?
How does this stable-loss construction generalize from
binary
(softplus on one logit) to the
multiclass
softmax cross-entropy case?
If you double the batch size, how should you adjust the learning rate, and what would you monitor to know whether the adjustment was too aggressive?