PracHub
QuestionsCoachesLearningGuidesInterview Prep

Quick Overview

This question evaluates a candidate's ability to implement single-head scaled dot-product attention, including correct matrix operations for Q, K, V, masked attention handling, and numerically stable softmax, testing competencies in linear algebra and machine learning model internals within the Coding & Algorithms / Machine Learning domain.

  • medium
  • Meta
  • Coding & Algorithms
  • Machine Learning Engineer

Implement scaled dot-product attention

Company: Meta

Role: Machine Learning Engineer

Category: Coding & Algorithms

Difficulty: medium

Interview Round: Onsite

## Task In this interview you are asked to **hand-write the forward pass of attention** from the mathematical formula (no need to run code). Implement **single-head scaled dot-product attention**. ### Inputs - `Q`: a 2D array of shape `(Tq, d)` (queries) - `K`: a 2D array of shape `(Tk, d)` (keys) - `V`: a 2D array of shape `(Tk, dv)` (values) - Optional `mask`: a 2D array of shape `(Tq, Tk)` where `mask[i][j] = 0/False` means position `j` is **not allowed** to be attended to by query `i`. ### Output - `O`: a 2D array of shape `(Tq, dv)` defined by: 1) Compute scores: \[ S = \frac{QK^\top}{\sqrt{d}} \quad \text{(shape } Tq \times Tk\text{)} \] 2) Apply mask (if provided): for any masked-out entry `S[i][j]`, treat it as \(-\infty\) before softmax. 3) Apply row-wise softmax to get attention weights: \[ A_{i,:} = \mathrm{softmax}(S_{i,:}) \] 4) Compute the output: \[ O = AV \] ### Requirements / Discussion points - Write clear pseudocode or Python-like code for the above. - Explain how you would implement **numerically stable softmax**. - State the time complexity in terms of `Tq`, `Tk`, `d`, and `dv`. - Mention at least two edge cases (e.g., fully-masked row, large magnitudes, shape mismatches).

Quick Answer: This question evaluates a candidate's ability to implement single-head scaled dot-product attention, including correct matrix operations for Q, K, V, masked attention handling, and numerically stable softmax, testing competencies in linear algebra and machine learning model internals within the Coding & Algorithms / Machine Learning domain.

Implement the forward pass of single-head **scaled dot-product attention**. ### Inputs - `Q`: a 2D list of shape `(Tq, d)` (queries) - `K`: a 2D list of shape `(Tk, d)` (keys) - `V`: a 2D list of shape `(Tk, dv)` (values) - `mask` (optional): a 2D list of shape `(Tq, Tk)` where a falsy entry `mask[i][j]` (0/False) means key position `j` is **not allowed** to be attended to by query `i`. If `mask` is `None`, all positions are allowed. ### Output Return `O`, a 2D list of shape `(Tq, dv)`: 1. **Scores:** `S = (Q · Kᵀ) / sqrt(d)`, shape `(Tq, Tk)`. 2. **Mask:** for any masked-out entry, set `S[i][j] = -inf` before softmax. 3. **Softmax (row-wise):** `A[i,:] = softmax(S[i,:])` using the numerically stable form (subtract the row max before exponentiating). 4. **Output:** `O = A · V`. ### Conventions for this challenge - Use a **numerically stable softmax**: subtract the maximum *finite* score in the row before calling `exp`. Masked entries (`-inf`) contribute weight `0`. - For a **fully-masked row** (every key disallowed), output a **uniform** distribution over all `Tk` keys (weight `1/Tk` each) rather than producing `NaN`. - If `Q` is empty (`Tq == 0`), return `[]`. - Round every output value to **6 decimal places** so results are deterministic.

Constraints

  • 1 <= d, dv (each row of Q/K has length d; each row of V has length dv)
  • 0 <= Tq, Tk (Tq == 0 returns an empty list)
  • K and V share the same first dimension Tk
  • mask, when provided, has shape (Tq, Tk); a falsy entry forbids attending to that key
  • All numeric inputs are real-valued floats; output values are rounded to 6 decimal places

Examples

Input: ([[1.0, 0.0]], [[1.0, 0.0], [0.0, 1.0]], [[1.0, 2.0], [3.0, 4.0]], None)

Expected Output: [[1.660477, 2.660477]]

Explanation: Single query [1,0], d=2 so scale=sqrt(2). Raw scores = [1,0]; scaled = [0.7071, 0]. Softmax ~ [0.6698, 0.3302]; output = 0.6698*[1,2] + 0.3302*[3,4] = [1.660477, 2.660477].

Input: ([[0.0, 0.0]], [[5.0, -3.0], [2.0, 8.0]], [[10.0, 0.0], [0.0, 10.0]], None)

Expected Output: [[5.0, 5.0]]

Explanation: Query is all zeros, so every score is 0 regardless of keys; softmax gives uniform weights [0.5, 0.5]; output = 0.5*[10,0] + 0.5*[0,10] = [5,5].

Input: ([[1.0, 1.0], [2.0, -1.0]], [[1.0, 0.0], [0.0, 1.0]], [[1.0], [2.0]], [[1, 0], [1, 1]])

Expected Output: [[1.0], [1.107042]]

Explanation: Row 0 masks out key 1, so only key 0 is attended -> weight 1.0 -> output V[0]=[1.0]. Row 1 attends both keys with stable softmax over the scaled scores, giving [1.107042].

Input: ([[2.0, 2.0]], [[1.0, 1.0], [1.0, 1.0]], [[4.0, 8.0], [4.0, 8.0]], None)

Expected Output: [[4.0, 8.0]]

Explanation: Both keys are identical so scores tie; softmax is uniform [0.5,0.5], and both value rows are equal, so the output equals either value row: [4,8].

Input: ([], [[1.0]], [[1.0]], None)

Expected Output: []

Explanation: Edge case: empty query matrix (Tq=0) returns an empty output list.

Input: ([[3.0]], [[1.0], [2.0]], [[7.0, 1.0], [9.0, 2.0]], [[0, 0]])

Expected Output: [[8.0, 1.5]]

Explanation: Fully-masked row (both keys disallowed). Instead of NaN, fall back to a uniform distribution [0.5,0.5] over all keys: output = 0.5*[7,1] + 0.5*[9,2] = [8,1.5].

Hints

  1. Process one query row at a time: compute its Tk scores, softmax them, then combine the value rows.
  2. Numerically stable softmax: subtract the row's maximum *finite* score before exponentiating; this prevents overflow for large magnitudes and leaves the result unchanged.
  3. Treat a masked entry as score -inf, which exponentiates to weight 0 — but guard the fully-masked row separately (all -inf) to avoid dividing by zero; here we fall back to a uniform distribution.
  4. Don't forget the 1/sqrt(d) scaling on the raw dot products before softmax.
Last updated: Jun 26, 2026

Loading coding console...

PracHub

Master your tech interviews with 8,000+ real questions from top companies.

Product

  • Questions
  • Learning Tracks
  • Interview Guides
  • Resources
  • Premium
  • For Universities
  • Student Access

Browse

  • By Company
  • By Role
  • By Category
  • Topic Hubs
  • SQL Questions
  • AI Coding Questions
  • Compare Platforms
  • Discord Community

Support

  • support@prachub.com
  • (916) 541-4762

Legal

  • Privacy Policy
  • Terms of Service
  • About Us

© 2026 PracHub. All rights reserved.

Related Coding Questions

  • Find Shortest Unique Prefixes - Meta (medium)
  • Compute Exclusive Execution Times - Meta (medium)
  • Solve Tree Columns And Maze Variants - Meta (medium)
  • Solve Tree Diameter and Palindromic Counts - Meta (medium)
  • Simulate Monster Team Battles - Meta (hard)