PracHub
QuestionsPremiumLearningGuidesCheatsheetNEWCoaches
|Home/Machine Learning/Apple

Implement multi-head self-attention correctly

Last updated: Mar 29, 2026

Quick Overview

This question evaluates a candidate's practical implementation skills and conceptual understanding of multi-head self-attention, including query/key/value projections, head-wise tensor reshaping, masking behavior, and considerations for numerical stability and computational complexity.

  • hard
  • Apple
  • Machine Learning
  • Software Engineer

Implement multi-head self-attention correctly

Company: Apple

Role: Software Engineer

Category: Machine Learning

Difficulty: hard

Interview Round: Technical Screen

Implement multi-head self-attention from scratch in PyTorch or NumPy. Given input X with shape (batch_size, seq_len, d_model), implement linear projections for Q, K, and V; split into h heads (d_k = d_model / h); compute scaled dot-product attention with optional padding and causal masks; then concatenate heads and apply an output projection. Ensure all tensor shapes and transpositions are correct. Provide the forward pass, explain the shape of each intermediate tensor, analyze time and memory complexity, and discuss numerical stability considerations (e.g., softmax scaling and mixed precision).

Quick Answer: This question evaluates a candidate's practical implementation skills and conceptual understanding of multi-head self-attention, including query/key/value projections, head-wise tensor reshaping, masking behavior, and considerations for numerical stability and computational complexity.

Related Interview Questions

  • Implement Masked Multi-Head Self-Attention - Apple (easy)
  • Compare DCN v1 vs v2 and A/B test - Apple (medium)
  • Explain dataset size, generalization, and U-Net skips - Apple (medium)
  • Analyze vision model failures - Apple (medium)
  • Compare audio preprocessing and training - Apple (medium)
Apple logo
Apple
Sep 6, 2025, 12:00 AM
Software Engineer
Technical Screen
Machine Learning
9
0

Implement Multi-Head Self-Attention (from scratch)

Context

You are given an input tensor X with shape (batch_size, seq_len, d_model). Implement a multi-head self-attention layer (forward pass) using PyTorch or NumPy that:

  • Projects inputs into queries (Q), keys (K), and values (V).
  • Splits into h heads with per-head dimension d_k = d_model / h.
  • Computes scaled dot-product attention with optional padding and causal masks.
  • Concatenates heads and applies an output projection.

Assume d_model is divisible by h.

Requirements

  1. Implement the forward pass with correct tensor shapes and transpositions.
  2. Support optional masks:
    • Padding mask (e.g., shape (batch_size, seq_len) or broadcastable variants).
    • Causal mask (prevent attending to future positions).
  3. Explain the shape of each intermediate tensor.
  4. Analyze time and memory complexity.
  5. Discuss numerical stability (e.g., scaling, masking, softmax stability, mixed precision).

Solution

Show

Comments (0)

Sign in to leave a comment

Loading comments...

Browse More Questions

More Machine Learning•More Apple•More Software Engineer•Apple Software Engineer•Apple Machine Learning•Software Engineer Machine Learning
PracHub

Master your tech interviews with 7,500+ real questions from top companies.

Product

  • Questions
  • Learning Tracks
  • Interview Guides
  • Resources
  • Premium
  • For Universities
  • Student Access

Browse

  • By Company
  • By Role
  • By Category
  • Topic Hubs
  • SQL Questions
  • Compare Platforms
  • Discord Community

Support

  • support@prachub.com
  • (916) 541-4762

Legal

  • Privacy Policy
  • Terms of Service
  • About Us

© 2026 PracHub. All rights reserved.