PracHub
QuestionsCoachesLearningGuidesInterview Prep
|Home/Machine Learning/OpenAI

Implement and Debug Backprop in NumPy

Last updated: Jun 21, 2026

Quick Overview

This question evaluates understanding of backpropagation and the chain rule, proficiency in vectorized NumPy implementation, numerical stability of softmax/cross-entropy, and competency in gradient checking and debugging of model gradients.

  • medium
  • OpenAI
  • Machine Learning
  • Software Engineer

Implement and Debug Backprop in NumPy

Company: OpenAI

Role: Software Engineer

Category: Machine Learning

Difficulty: medium

Interview Round: Technical Screen

Derive backpropagation for a two-layer neural network (Affine → ReLU → Affine → softmax). Write NumPy (Python 3. 7) code to compute the forward pass, loss, and analytical gradients without automatic differentiation using vectorized linear algebra. Implement gradient checking via finite differences and demonstrate how you would debug and fix a mismatch between analytical and numerical gradients. Clearly state tensor shapes at each layer, show the chain rule application, and explain any linear algebra identities used (e.g., broadcasting rules, transpose of products).

Quick Answer: This question evaluates understanding of backpropagation and the chain rule, proficiency in vectorized NumPy implementation, numerical stability of softmax/cross-entropy, and competency in gradient checking and debugging of model gradients.

Related Interview Questions

  • Implement 1NN with NumPy - OpenAI (medium)
  • Compute entropy and implement 1-NN - OpenAI (medium)
  • Defend a Research Direction and Experiment Design - OpenAI (medium)
  • Implement Backprop for a Tiny Network - OpenAI (hard)
  • Debug MiniGPT and Backpropagate Matmul - OpenAI (medium)
|Home/Machine Learning/OpenAI

Implement and Debug Backprop in NumPy

OpenAI logo
OpenAI
Sep 6, 2025, 12:00 AM
mediumSoftware EngineerTechnical ScreenMachine Learning
75
0

Two-Layer Neural Network: Backpropagation and Gradient Check (NumPy)

You are implementing a fully connected two-layer neural network for multi-class classification with the architecture:

Affine (XW1 + b1) → ReLU → Affine (·W2 + b2) → softmax → cross-entropy loss

Assume a mini-batch of N examples, input dimension D, hidden size H, and C classes.

TensorShapeMeaning
XN × Dmini-batch of inputs
y(N,)integer class labels in {0, 1, …, C−1}
W1, b1D × H, (H,)first affine layer
W2, b2H × C, (C,)second affine layer

This is a coding-and-debugging interview: it is as much about converting conceptual understanding into correct code and debugging skills as it is about the math. Across the parts below you will (1) derive the backprop equations, (2) implement the forward/loss/backward in NumPy, (3) build a gradient checker, and (4) walk through diagnosing a deliberate mismatch.

Constraints & Assumptions

  • Language/libraries: Python 3.7+, NumPy only. No automatic differentiation (no PyTorch/TensorFlow/JAX/ autograd ).
  • Vectorization: No explicit Python loops over the batch in the forward/backward path. (Loops are permitted inside the gradient checker.)
  • Loss: Mean (not sum) cross-entropy over the mini-batch, i.e. averaged by 1/N1/N1/N .
  • Numerical stability: Softmax must subtract the row-wise max from the logits before exponentiating.
  • Scale of the toy problem: Small enough to grad-check by brute force (e.g. N≤10N \le 10N≤10 , with D,H,CD, H, CD,H,C in the single/low-double digits) so finite differences run quickly.
  • Precision: float64 throughout; finite-difference step ε≈10−5\varepsilon \approx 10^{-5}ε≈10−5 , central differences.

Clarifying Questions to Ask

  • Is the loss the mean over the batch or the sum ? (This fixes whether a 1/N1/N1/N factor appears in the gradients.)
  • Should I include an L2 regularization term on the weights, or just the data loss?
  • Do you want the gradient w.r.t. the input X as well, or only the parameter gradients ( W1, b1, W2, b2 )?
  • What is the ReLU convention at exactly z=0z = 0z=0 — subgradient 0 or 1? (Does it matter for grad-checking?)
  • What tolerance counts as "passing" the gradient check, and on what dataset size should I demonstrate it?
  • Can I assume labels y are integer class indices (not one-hot)?

Part 1 — Derive the backpropagation equations

Derive ∂L/∂W1\partial L/\partial W_1∂L/∂W1​, ∂L/∂b1\partial L/\partial b_1∂L/∂b1​, ∂L/∂W2\partial L/\partial W_2∂L/∂W2​, ∂L/∂b2\partial L/\partial b_2∂L/∂b2​.

  • Apply the chain rule explicitly, layer by layer.
  • State the tensor shape at each step .
  • Explain the linear-algebra identities you use — e.g. (AB)^T = B^T A^T , the broadcasting rule for the biases, and how the affine gradient ∂L/∂W = X^T · upstream arises.

What This Part Should Cover

  • Chain-rule fluency — clean layer-by-layer decomposition; the fused softmax+CE result derived (not just quoted), with the 1/N1/N1/N factor placed correctly.
  • Shape discipline — a stated shape at every intermediate, and grad.shape == param.shape treated as the invariant that pins down each transpose.
  • Identity justification — (AB)^T = B^T A^T , the bias-as-sum-over-rows argument, and why ∂L/∂W=XTG\partial L/\partial W = X^T G∂L/∂W=XTG rather than G XTG\,X^TGXT .

Part 2 — Implement the NumPy code

Implement the forward pass, the softmax cross-entropy loss, and the analytical gradients in NumPy (Python 3.7+).

  • Use vectorized linear algebra; no automatic differentiation and no explicit Python loops over the batch.
  • Use a numerically stable softmax (subtract the row-wise max before exponentiating).
  • Include in-code comments indicating shapes and broadcasting at each layer and each gradient.

What This Part Should Cover

  • Vectorized, autodiff-free code that runs as written with only NumPy, with no batch loops on the forward/backward path.
  • Numerical stability — the row-max-subtracting softmax, and an argument for why it is exact.
  • The fused backward — dScores formed in one shot without materializing a per-example Jacobian, with the single 1/N1/N1/N applied once.
  • Self-documenting shapes — inline comments asserting the shape and the broadcast at every layer and gradient.

Part 3 — Implement gradient checking

Verify your analytical gradients against central finite differences on a small synthetic (toy) dataset, and report the relative error.

What This Part Should Cover

  • A correct two-sided estimator with strict one-entry-at-a-time perturbation and guaranteed parameter restoration.
  • A sound relative-error metric ( ∣a−b∣/(∣a∣+∣b∣)|a-b|/(|a|+|b|)∣a−b∣/(∣a∣+∣b∣) , not floored at 1.0) plus a stated pass/fail threshold.
  • A demonstrated run on a toy dataset with a defensible ε\varepsilonε and a loss sanity-check against log⁡C\log ClogC .

Part 4 — Demonstrate debugging

Show how you would diagnose and fix a mismatch between the analytical and numerical gradients (for example, a missing 1/N averaging factor or an incorrect ReLU mask). Explain the symptom, how you localize it, and the fix.

What This Part Should Cover

  • Symptom → localization → fix for at least two distinct bugs (e.g. the 1/N1/N1/N factor and the ReLU mask), each with its tell-tale signature.
  • Reading the signature — uniform-factor-across-all-params vs. layer-localized vs. shape-crash vs. nan , and what each implies.
  • Active isolation — toggling NNN , forcing dead ReLU units, and shape assertions used as deliberate probes, not guesses.

What a Strong Answer Covers

These signals span all four parts:

  • Shape discipline as an invariant carried from the derivation into the code into the debugging.
  • Correct, single placement of the 1/N1/N1/N factor , and the ability to detect a violation by its symptom.
  • Communication — narrating conventions and trade-offs out loud (e.g. masking the pre-activation vs. the activation, why central differences over one-sided).

Follow-up Questions

  • How does the cost of finite-difference gradient checking scale with the number of parameters, and why is it used only as a test rather than for training?
  • ReLU is non-differentiable at z=0z = 0z=0 . Why does the gradient check still pass in practice, and when could a unit sitting exactly on the kink actually cause a mismatch?
  • If you added L2 regularization to the loss, what exactly changes in dW1 and dW2 , and what is the classic self-inflicted grad-check failure that results from forgetting it?
  • How would this derivation and code generalize to an LLL-layer network, and which quantity that you computed here (but didn't strictly need) becomes essential?
Loading comments...

Browse More Questions

More Machine Learning•More OpenAI•More Software Engineer•OpenAI Software Engineer•OpenAI Machine Learning•Software Engineer Machine Learning

Write your answer

Your first approved answer each day earns 20 XP.

Sign in to write your answer.
PracHub

Master your tech interviews with 8,000+ real questions from top companies.

Product

  • Questions
  • Learning Tracks
  • Interview Guides
  • Resources
  • Premium
  • For Universities
  • Student Access

Browse

  • By Company
  • By Role
  • By Category
  • Topic Hubs
  • SQL Questions
  • AI Coding Questions
  • Compare Platforms
  • Discord Community

Support

  • support@prachub.com
  • (916) 541-4762

Legal

  • Privacy Policy
  • Terms of Service
  • About Us

© 2026 PracHub. All rights reserved.