Implement scaled dot-product attention

Q: Implement scaled dot-product attention

This question evaluates a candidate's ability to implement single-head scaled dot-product attention, including correct matrix operations for Q, K, V, masked attention handling, and numerically stable softmax, testing competencies in linear algebra and machine learning model internals within the Coding & Algorithms / Machine Learning domain.

Q: How do I approach Coding & Algorithms interview questions?

Coding & Algorithms questions require understanding of core concepts and practice. PracHub provides solutions with explanations to help you master coding & algorithms interviews.

Question

Task

In this interview you are asked to hand-write the forward pass of attention from the mathematical formula (no need to run code).

Implement single-head scaled dot-product attention.

Inputs

Q : a 2D array of shape (Tq, d) (queries)
K : a 2D array of shape (Tk, d) (keys)
V : a 2D array of shape (Tk, dv) (values)
Optional mask : a 2D array of shape (Tq, Tk) where mask[i][j] = 0/False means position j is not allowed to be attended to by query i .

Output

O : a 2D array of shape (Tq, dv) defined by:

Compute scores:

$S = \frac{QK^\top}{\sqrt{d}} \quad \text{(shape } Tq \times Tk\text{)}$

Apply mask (if provided): for any masked-out entry S[i][j] , treat it as $-\infty$ before softmax.
Apply row-wise softmax to get attention weights:

$A_{i,:} = \mathrm{softmax}(S_{i,:})$

Compute the output:

$O = AV$

Requirements / Discussion points

Write clear pseudocode or Python-like code for the above.
Explain how you would implement numerically stable softmax .
State the time complexity in terms of Tq , Tk , d , and dv .
Mention at least two edge cases (e.g., fully-masked row, large magnitudes, shape mismatches).

Implement scaled dot-product attention

Overview

Task

Inputs

Output

Requirements / Discussion points

Comments (0)