PracHub
QuestionsCoachesLearningGuidesInterview Prep
|Home/Machine Learning/Anthropic

Implement and derive backprop from scratch

Last updated: Jun 21, 2026

Quick Overview

This question evaluates understanding and practical implementation of neural network fundamentals, specifically analytic backpropagation and gradient derivation, numerically stable binary cross-entropy computation, parameter initialization, gradient descent updates, and gradient checking.

  • medium
  • Anthropic
  • Machine Learning
  • Software Engineer

Implement and derive backprop from scratch

Company: Anthropic

Role: Software Engineer

Category: Machine Learning

Difficulty: medium

Interview Round: Onsite

From first principles, implement a tiny neural network for binary classification with one hidden layer. Write the forward pass, a cross-entropy loss, and the complete backward pass (no autograd). Derive analytic gradients for all parameters, implement gradient descent, and verify correctness with finite-difference gradient checking. Discuss numerical stability (e.g., log-sum-exp), initialization, activation choices, and how batch size affects variance.

Quick Answer: This question evaluates understanding and practical implementation of neural network fundamentals, specifically analytic backpropagation and gradient derivation, numerically stable binary cross-entropy computation, parameter initialization, gradient descent updates, and gradient checking.

Related Interview Questions

  • Design a Double Descent Experiment - Anthropic (medium)
  • Explain batch inference design - Anthropic (medium)
  • Debug a GRPO training loop and explain ratios - Anthropic (medium)
  • Implement and analyze custom attention - Anthropic (hard)
|Home/Machine Learning/Anthropic

Implement and derive backprop from scratch

Anthropic logo
Anthropic
Sep 6, 2025, 12:00 AM
mediumSoftware EngineerOnsiteMachine Learning
40
0

Tiny Neural Network From First Principles: Binary Classification

Implement and analyze a minimal neural network for binary classification with a single hidden layer, using vectorized NumPy (or a similar array library) without autograd — every gradient must be derived and coded by hand.

Assume a dataset with features X∈RN×DX \in \mathbb{R}^{N \times D}X∈RN×D and labels y∈{0,1}Ny \in \{0,1\}^Ny∈{0,1}N. The network is:

  • Hidden layer: HHH units with an activation fff (ReLU or tanh).
  • Output layer: a single unit with a sigmoid producing P(y=1∣x)P(y=1 \mid x)P(y=1∣x) .

The deliverable is a complete, self-contained training pipeline (forward, loss, backward, optimization, gradient check) plus a short written discussion of the numerical and design choices. The question is split into six parts below.

Constraints & Assumptions

  • Parameter shapes are fixed: W1∈RD×HW_1 \in \mathbb{R}^{D \times H}W1​∈RD×H , b1∈RHb_1 \in \mathbb{R}^{H}b1​∈RH , W2∈RH×1W_2 \in \mathbb{R}^{H \times 1}W2​∈RH×1 , b2∈R1b_2 \in \mathbb{R}^{1}b2​∈R1 .
  • Computation is vectorized over the batch — no Python loops over the NNN examples in the forward/backward path.
  • No autograd / no deep-learning framework gradients — analytic derivatives only. Finite differences are allowed solely for the gradient-check verification step (Part 5).
  • Work in float64 for the implementation and especially for gradient checking.
  • The loss is the mean (not sum) binary cross-entropy over the batch.

Clarifying Questions to Ask

  • Should the loss be the mean or sum over the batch? (This decision changes the gradient scale and how the learning rate is interpreted.)
  • Which hidden activation is in scope — ReLU, tanh, or both ? (It changes the derivative term and the recommended initialization.)
  • Is mini-batch SGD expected, or is full-batch / single-batch gradient descent sufficient for the deliverable?
  • What floating-point precision should the reference target, and how tight should the gradient-check tolerance be?
  • Should the numerically-stable loss be expressed in terms of the probability ppp or the logit z2z_2z2​ ? (This is the crux of the stability design.)
  • Is a runnable end-to-end demo on a toy dataset expected, or just the component functions?

Part 1 — Forward pass

Compute, in vectorized form:

z1=XW1+b1,a1=f(z1),z2=a1W2+b2,p=σ(z2),z_1 = X W_1 + b_1, \qquad a_1 = f(z_1), \qquad z_2 = a_1 W_2 + b_2, \qquad p = \sigma(z_2),z1​=XW1​+b1​,a1​=f(z1​),z2​=a1​W2​+b2​,p=σ(z2​),

where σ\sigmaσ is the logistic sigmoid. State the shape of each intermediate and confirm the biases broadcast correctly across the NNN rows.

What This Part Should Cover Premium

Part 2 — Loss (numerically stable binary cross-entropy)

Implement mean binary cross-entropy. The naive form −[ylog⁡p+(1−y)log⁡(1−p)]-[y\log p + (1-y)\log(1-p)]−[ylogp+(1−y)log(1−p)] blows up once ppp rounds to exactly 000 or 111. Use a stable formulation (e.g. softplus log⁡(1+ex)\log(1+e^{x})log(1+ex) or log-sum-exp) so the loss never produces ±∞\pm\infty±∞ or NaN.

What This Part Should Cover Premium

Part 3 — Backward pass (analytic gradients, no autograd)

Derive and implement ∂L/∂W1\partial L/\partial W_1∂L/∂W1​, ∂L/∂b1\partial L/\partial b_1∂L/∂b1​, ∂L/∂W2\partial L/\partial W_2∂L/∂W2​, ∂L/∂b2\partial L/\partial b_2∂L/∂b2​ by the chain rule. State the derivative of your hidden activation explicitly.

What This Part Should Cover Premium

Part 4 — Optimization

Implement vanilla gradient-descent updates θ←θ−η gθ\theta \leftarrow \theta - \eta\, g_\thetaθ←θ−ηgθ​ for all four parameters.

What This Part Should Cover Premium

Part 5 — Gradient checking

Verify your analytic gradients against central finite differences

gnum≈L(θ+ε)−L(θ−ε)2ε,g_{\text{num}} \approx \frac{L(\theta + \varepsilon) - L(\theta - \varepsilon)}{2\varepsilon},gnum​≈2εL(θ+ε)−L(θ−ε)​,

and report the relative error per parameter.

What This Part Should Cover Premium

Part 6 — Discussion

Write short notes on each:

  • Numerical stability — sigmoid/logistic loss, softplus / log-sum-exp, log1p , expm1 , clipping.
  • Initialization — He vs Xavier, and bias initialization.
  • Activation choices — ReLU, tanh, sigmoid; pros and cons.
  • Batch size and gradient variance — and how it informs learning-rate scaling.

What This Part Should Cover Premium

What a Strong Answer Covers Premium

Follow-up Questions

  • Where exactly does the naive 1 - p lose precision, and which primitive ( expm1 / log1p ) recovers the tiny complement probability when ppp saturates to 111 ?
  • A hidden ReLU unit outputs 000 for every training example. What has happened, how would you detect it, and how would you mitigate it?
  • How does this stable-loss construction generalize from binary (softplus on one logit) to the multiclass softmax cross-entropy case?
  • If you double the batch size, how should you adjust the learning rate, and what would you monitor to know whether the adjustment was too aggressive?
Loading comments...

Browse More Questions

More Machine Learning•More Anthropic•More Software Engineer•Anthropic Software Engineer•Anthropic Machine Learning•Software Engineer Machine Learning

Write your answer

Your first approved answer each day earns 20 XP.

Sign in to write your answer.
PracHub

Master your tech interviews with 8,000+ real questions from top companies.

Product

  • Questions
  • Learning Tracks
  • Interview Guides
  • Resources
  • Premium
  • For Universities
  • Student Access

Browse

  • By Company
  • By Role
  • By Category
  • Topic Hubs
  • SQL Questions
  • AI Coding Questions
  • Compare Platforms
  • Discord Community

Support

  • support@prachub.com
  • (916) 541-4762

Legal

  • Privacy Policy
  • Terms of Service
  • About Us

© 2026 PracHub. All rights reserved.