Implement scaled dot-product attention
Company: Meta
Role: Machine Learning Engineer
Category: Coding & Algorithms
Difficulty: medium
Interview Round: Onsite
## Task
In this interview you are asked to **hand-write the forward pass of attention** from the mathematical formula (no need to run code).
Implement **single-head scaled dot-product attention**.
### Inputs
- `Q`: a 2D array of shape `(Tq, d)` (queries)
- `K`: a 2D array of shape `(Tk, d)` (keys)
- `V`: a 2D array of shape `(Tk, dv)` (values)
- Optional `mask`: a 2D array of shape `(Tq, Tk)` where `mask[i][j] = 0/False` means position `j` is **not allowed** to be attended to by query `i`.
### Output
- `O`: a 2D array of shape `(Tq, dv)` defined by:
1) Compute scores:
\[
S = \frac{QK^\top}{\sqrt{d}} \quad \text{(shape } Tq \times Tk\text{)}
\]
2) Apply mask (if provided): for any masked-out entry `S[i][j]`, treat it as \(-\infty\) before softmax.
3) Apply row-wise softmax to get attention weights:
\[
A_{i,:} = \mathrm{softmax}(S_{i,:})
\]
4) Compute the output:
\[
O = AV
\]
### Requirements / Discussion points
- Write clear pseudocode or Python-like code for the above.
- Explain how you would implement **numerically stable softmax**.
- State the time complexity in terms of `Tq`, `Tk`, `d`, and `dv`.
- Mention at least two edge cases (e.g., fully-masked row, large magnitudes, shape mismatches).
Quick Answer: This question evaluates a candidate's ability to implement single-head scaled dot-product attention, including correct matrix operations for Q, K, V, masked attention handling, and numerically stable softmax, testing competencies in linear algebra and machine learning model internals within the Coding & Algorithms / Machine Learning domain.