PracHub
QuestionsCoachesLearningGuidesInterview Prep
|Home/Machine Learning/OpenAI

Derive Backpropagation for Matrix-Product Layers

Last updated: Jun 17, 2026

Quick Overview

This question evaluates understanding of backpropagation through linear blocks, matrix calculus, and the computation of gradients with respect to individual weight matrices in a product of trainable matrices, including reasoning about tensor shapes and matrix differentials.

  • hard
  • OpenAI
  • Machine Learning
  • Machine Learning Engineer

Derive Backpropagation for Matrix-Product Layers

Company: OpenAI

Role: Machine Learning Engineer

Category: Machine Learning

Difficulty: hard

Interview Round: Technical Screen

Consider a neural-network block whose output is produced by multiplying a *sequence* of trainable weight matrices together, then applying the resulting composite matrix to an input. There is no nonlinearity *inside* this block — it is a pure chain of matrix multiplications. Let the trainable matrices be $W_1, W_2, \ldots, W_n$. Their cumulative product defines a single effective linear map $$ C = W_1 W_2 \cdots W_n . $$ Given an input vector or mini-batch $X$, the forward pass is $$ Z = C\,X = W_1 W_2 \cdots W_n\, X . $$ There is a scalar loss $\mathcal{L}$, and the upstream gradient $$ G = \frac{\partial \mathcal{L}}{\partial Z} $$ is supplied by the loss or by later layers (so $G$ has the same shape as $Z$). Derive the **backward pass** for this block. Specifically: 1. Express the gradient $\dfrac{\partial \mathcal{L}}{\partial W_j}$ with respect to **each** individual matrix $W_j$, for every $1 \le j \le n$. 2. Show explicitly how the multivariate chain rule (or the matrix differential) applies to the matrix product — i.e. *why* the answer has the form it does, not just the final formula. 3. Verify that the resulting gradient $\dfrac{\partial \mathcal{L}}{\partial W_j}$ has the same shape as $W_j$. 4. Describe an efficient implementation that avoids recomputing the same prefix and suffix matrix products for every $j$. ```hint Isolate one matrix $W_j$ sits in the *middle* of the product, which makes the chain rule awkward. Try freezing every other matrix and viewing $Z$ as the composition of three pieces: a left **prefix** built from the matrices before $W_j$, the target $W_j$ itself, and a right **suffix** built from the matrices after it (the suffix can absorb $X$). Relative to $W_j$ alone, what kind of map is this — and what does that buy you? ``` ```hint Differentiate the isolated form Once $W_j$ is isolated, perturb only $W_j$ and ask how $Z$ changes to first order. For a *scalar* loss, the contraction of the upstream gradient against a matrix change is naturally written with a trace: $d\mathcal{L} = \operatorname{tr}(G^\top\, dZ)$. The trace's invariance under cyclic shifts is the tool that lets you collect everything multiplying $dW_j$ into one factor. Be careful about which side each frozen piece lands on and where transposes appear. ``` ```hint Avoid the $O(n^2)$ trap The per-matrix formula re-uses overlapping prefix/suffix products, so recomputing them for each $j$ is quadratic. Ask: do you ever need the prefix as a full *matrix*, or only its action on the upstream gradient? And can the suffix pieces all be obtained from a single pass in one direction? Two coordinated sweeps in opposite directions should be enough to produce every gradient in linear time. ``` ### Constraints & Assumptions - Shapes: let $W_j \in \mathbb{R}^{d_{j-1} \times d_j}$, so the composite $C \in \mathbb{R}^{d_0 \times d_n}$. For a mini-batch, $X \in \mathbb{R}^{d_n \times B}$ and $Z, G \in \mathbb{R}^{d_0 \times B}$; for a single example take $B = 1$. - The matrices need **not** be square; adjacent inner dimensions just have to match so the product is defined. - There is no nonlinearity or bias term inside this block — it is a pure chain of matrix multiplications. Any activation lives in the layers that produce $G$. - "Efficient" means linear in the number of matrices $n$: the whole backward pass should cost $O(n)$ matrix products, not $O(n^2)$, and should avoid materializing large $d_0 \times d_{j-1}$ prefix matrices when $B$ is small. ### Clarifying Questions to Ask - Is $X$ a single vector or a mini-batch, and is the loss summed or **averaged** over the batch (which scales every $dW_j$)? - Should I treat $X$ as fixed input, or is the gradient $\partial\mathcal{L}/\partial X$ also required? - Are the matrices guaranteed non-singular / square, or fully general rectangular? - Is there any nonlinearity or bias term between the matrices, or is the block a pure linear chain as stated? - What's the target metric — closed-form gradient formula, an $O(n)$ algorithm, or working code/pseudocode? - Do we ever need to materialize the composite $C$ at inference, or only its action $CX$? ### What a Strong Answer Covers - Cleanly isolates a single $W_j$ by grouping the rest of the product into a left prefix and a right suffix, and states the empty-product-is-identity convention. - A genuine **derivation** (matrix differential + trace cyclic property, or an equivalent index/chain-rule argument) that *produces* the per-matrix gradient rather than asserting it, with transposes and left/right placement justified. - A closed-form expression for $\partial\mathcal{L}/\partial W_j$ in terms of the prefix, the upstream gradient $G$, and the suffix. - An explicit dimension check confirming the gradient has the same shape as $W_j$ (i.e. $d_{j-1}\times d_j$), with the batch dimension correctly contracted away. - A linear-time ($O(n)$) algorithm that computes all $n$ gradients without recomputing overlapping products or materializing the large prefix matrices — i.e. coordinated sweeps in both directions, with correct transpose placement. - Awareness of practical concerns: batch averaging applied exactly once, and the link to standard reverse-mode autodiff over a linear chain. ### Follow-up Questions - How does the cost and the formula change if you also need $\partial\mathcal{L}/\partial X$? - Numerically, why is composing the matrices into a single $C$ and backpropping through $C$ alone *not* equivalent — what information about the individual $W_j$ would you lose? - If $n$ is large, what conditioning / vanishing-or-exploding-gradient issues arise from the repeated products, and how would you mitigate them? - Suppose two of the matrices are tied (the *same* parameter reused at positions $j$ and $k$). How does the gradient w.r.t. that shared parameter change?

Quick Answer: This question evaluates understanding of backpropagation through linear blocks, matrix calculus, and the computation of gradients with respect to individual weight matrices in a product of trainable matrices, including reasoning about tensor shapes and matrix differentials.

Related Interview Questions

  • Implement 1NN with NumPy - OpenAI (medium)
  • Compute entropy and implement 1-NN - OpenAI (medium)
  • Defend a Research Direction and Experiment Design - OpenAI (medium)
  • Implement Backprop for a Tiny Network - OpenAI (hard)
  • Debug MiniGPT and Backpropagate Matmul - OpenAI (medium)
|Home/Machine Learning/OpenAI

Derive Backpropagation for Matrix-Product Layers

OpenAI logo
OpenAI
Feb 5, 2026, 12:00 AM
hardMachine Learning EngineerTechnical ScreenMachine Learning
33
0

Consider a neural-network block whose output is produced by multiplying a sequence of trainable weight matrices together, then applying the resulting composite matrix to an input. There is no nonlinearity inside this block — it is a pure chain of matrix multiplications.

Let the trainable matrices be W1,W2,…,WnW_1, W_2, \ldots, W_nW1​,W2​,…,Wn​. Their cumulative product defines a single effective linear map

C=W1W2⋯Wn.C = W_1 W_2 \cdots W_n .C=W1​W2​⋯Wn​.

Given an input vector or mini-batch XXX, the forward pass is

Z=C X=W1W2⋯Wn X.Z = C\,X = W_1 W_2 \cdots W_n\, X .Z=CX=W1​W2​⋯Wn​X.

There is a scalar loss L\mathcal{L}L, and the upstream gradient

G=∂L∂ZG = \frac{\partial \mathcal{L}}{\partial Z}G=∂Z∂L​

is supplied by the loss or by later layers (so GGG has the same shape as ZZZ).

Derive the backward pass for this block. Specifically:

  1. Express the gradient ∂L∂Wj\dfrac{\partial \mathcal{L}}{\partial W_j}∂Wj​∂L​ with respect to each individual matrix WjW_jWj​ , for every 1≤j≤n1 \le j \le n1≤j≤n .
  2. Show explicitly how the multivariate chain rule (or the matrix differential) applies to the matrix product — i.e. why the answer has the form it does, not just the final formula.
  3. Verify that the resulting gradient ∂L∂Wj\dfrac{\partial \mathcal{L}}{\partial W_j}∂Wj​∂L​ has the same shape as WjW_jWj​ .
  4. Describe an efficient implementation that avoids recomputing the same prefix and suffix matrix products for every jjj .

Constraints & Assumptions

  • Shapes: let Wj∈Rdj−1×djW_j \in \mathbb{R}^{d_{j-1} \times d_j}Wj​∈Rdj−1​×dj​ , so the composite C∈Rd0×dnC \in \mathbb{R}^{d_0 \times d_n}C∈Rd0​×dn​ . For a mini-batch, X∈Rdn×BX \in \mathbb{R}^{d_n \times B}X∈Rdn​×B and Z,G∈Rd0×BZ, G \in \mathbb{R}^{d_0 \times B}Z,G∈Rd0​×B ; for a single example take B=1B = 1B=1 .
  • The matrices need not be square; adjacent inner dimensions just have to match so the product is defined.
  • There is no nonlinearity or bias term inside this block — it is a pure chain of matrix multiplications. Any activation lives in the layers that produce GGG .
  • "Efficient" means linear in the number of matrices nnn : the whole backward pass should cost O(n)O(n)O(n) matrix products, not O(n2)O(n^2)O(n2) , and should avoid materializing large d0×dj−1d_0 \times d_{j-1}d0​×dj−1​ prefix matrices when BBB is small.

Clarifying Questions to Ask

  • Is XXX a single vector or a mini-batch, and is the loss summed or averaged over the batch (which scales every dWjdW_jdWj​ )?
  • Should I treat XXX as fixed input, or is the gradient ∂L/∂X\partial\mathcal{L}/\partial X∂L/∂X also required?
  • Are the matrices guaranteed non-singular / square, or fully general rectangular?
  • Is there any nonlinearity or bias term between the matrices, or is the block a pure linear chain as stated?
  • What's the target metric — closed-form gradient formula, an O(n)O(n)O(n) algorithm, or working code/pseudocode?
  • Do we ever need to materialize the composite CCC at inference, or only its action CXCXCX ?

What a Strong Answer Covers

  • Cleanly isolates a single WjW_jWj​ by grouping the rest of the product into a left prefix and a right suffix, and states the empty-product-is-identity convention.
  • A genuine derivation (matrix differential + trace cyclic property, or an equivalent index/chain-rule argument) that produces the per-matrix gradient rather than asserting it, with transposes and left/right placement justified.
  • A closed-form expression for ∂L/∂Wj\partial\mathcal{L}/\partial W_j∂L/∂Wj​ in terms of the prefix, the upstream gradient GGG , and the suffix.
  • An explicit dimension check confirming the gradient has the same shape as WjW_jWj​ (i.e. dj−1×djd_{j-1}\times d_jdj−1​×dj​ ), with the batch dimension correctly contracted away.
  • A linear-time ( O(n)O(n)O(n) ) algorithm that computes all nnn gradients without recomputing overlapping products or materializing the large prefix matrices — i.e. coordinated sweeps in both directions, with correct transpose placement.
  • Awareness of practical concerns: batch averaging applied exactly once, and the link to standard reverse-mode autodiff over a linear chain.

Follow-up Questions

  • How does the cost and the formula change if you also need ∂L/∂X\partial\mathcal{L}/\partial X∂L/∂X ?
  • Numerically, why is composing the matrices into a single CCC and backpropping through CCC alone not equivalent — what information about the individual WjW_jWj​ would you lose?
  • If nnn is large, what conditioning / vanishing-or-exploding-gradient issues arise from the repeated products, and how would you mitigate them?
  • Suppose two of the matrices are tied (the same parameter reused at positions jjj and kkk ). How does the gradient w.r.t. that shared parameter change?
Loading comments...

Browse More Questions

More Machine Learning•More OpenAI•More Machine Learning Engineer•OpenAI Machine Learning Engineer•OpenAI Machine Learning•Machine Learning Engineer Machine Learning

Write your answer

Your first approved answer each day earns 20 XP.

Sign in to write your answer.
PracHub

Master your tech interviews with 8,000+ real questions from top companies.

Product

  • Questions
  • Learning Tracks
  • Interview Guides
  • Resources
  • Premium
  • For Universities
  • Student Access

Browse

  • By Company
  • By Role
  • By Category
  • Topic Hubs
  • SQL Questions
  • AI Coding Questions
  • Compare Platforms
  • Discord Community

Support

  • support@prachub.com
  • (916) 541-4762

Legal

  • Privacy Policy
  • Terms of Service
  • About Us

© 2026 PracHub. All rights reserved.