This question evaluates a candidate's ability to implement single-head scaled dot-product attention, including correct matrix operations for Q, K, V, masked attention handling, and numerically stable softmax, testing competencies in linear algebra and machine learning model internals within the Coding & Algorithms / Machine Learning domain.

In this interview you are asked to hand-write the forward pass of attention from the mathematical formula (no need to run code).
Implement single-head scaled dot-product attention.
Q
: a 2D array of shape
(Tq, d)
(queries)
K
: a 2D array of shape
(Tk, d)
(keys)
V
: a 2D array of shape
(Tk, dv)
(values)
mask
: a 2D array of shape
(Tq, Tk)
where
mask[i][j] = 0/False
means position
j
is
not allowed
to be attended to by query
i
.
O
: a 2D array of shape
(Tq, dv)
defined by:
S[i][j]
, treat it as
before softmax.
Tq
,
Tk
,
d
, and
dv
.