PracHub
QuestionsPremiumLearningGuidesCheatsheetNEWCoaches
|Home/Machine Learning/Anthropic

Implement and analyze custom attention

Last updated: May 2, 2026

Quick Overview

This question evaluates implementation and analysis skills for scaled dot-product attention, testing competencies in efficient tensorized PyTorch coding, numerical stability and mixed-precision handling, masking semantics (causal and padding), multi-head shape correctness, and unit-test validation including gradient and edge-case behavior.

  • hard
  • Anthropic
  • Machine Learning
  • Software Engineer

Implement and analyze custom attention

Company: Anthropic

Role: Software Engineer

Category: Machine Learning

Difficulty: hard

Interview Round: Onsite

Implement a custom attention module in PyTorch from scratch. Given Q, K, V tensors, compute attention scores, apply a scaling factor, masking, normalization, and produce the output. Ensure the code runs end-to-end with tests on shape correctness, gradient flow, and numerical stability. Explain why scaling (e.g., by 1/sqrt(d_k)) is used, when and why to apply normalization (e.g., softmax vs. layer normalization), and analyze the time and memory complexity. Propose optimizations for long sequences.

Quick Answer: This question evaluates implementation and analysis skills for scaled dot-product attention, testing competencies in efficient tensorized PyTorch coding, numerical stability and mixed-precision handling, masking semantics (causal and padding), multi-head shape correctness, and unit-test validation including gradient and edge-case behavior.

Related Interview Questions

  • Design a Double Descent Experiment - Anthropic (medium)
  • Explain batch inference design - Anthropic (medium)
  • Debug a GRPO training loop and explain ratios - Anthropic (medium)
  • Implement and derive backprop from scratch - Anthropic (medium)
Anthropic logo
Anthropic
Sep 6, 2025, 12:00 AM
Software Engineer
Onsite
Machine Learning
27
0

Implement Scaled Dot-Product Attention in PyTorch (from scratch)

Context

You will implement a numerically stable, vectorized scaled dot-product attention module in PyTorch and validate it with unit tests. Assume Q, K, V can have different sequence lengths (Lq vs Lk) and optionally multiple heads.

Requirements

  1. Implementation
    • Create a PyTorch module that computes scaled dot-product attention:
      • Inputs: Q ∈ R^{B×H×Lq×d_k}, K ∈ R^{B×H×Lk×d_k}, V ∈ R^{B×H×Lk×d_v} (allow H=1 when working without heads; also accept 3D by treating it as H=1).
      • Compute scores S = Q K^T, apply scaling by 1/sqrt(d_k) (unless a custom scale is provided).
      • Support masking:
        • Causal mask (upper-triangular).
        • Padding/attention masks (broadcastable to B×H×Lq×Lk). Boolean masks should treat True as "masked out"; additive masks should be added to scores (e.g., 0 for keep, -inf for disallow).
      • Apply numerically stable softmax to get attention weights; apply optional dropout on weights.
      • Return both attention output (B×H×Lq×d_v) and attention weights (B×H×Lq×Lk).
      • Ensure fp16/bf16 inputs are handled by upcasting to float32 for the softmax and downcasting afterward.
      • Ensure rows that are fully masked produce zero attention weights and zero outputs (no NaNs).
  2. Tests
    • Shape correctness: multiple heads, different Lq and Lk, different d_k and d_v.
    • Gradient flow: backprop from a scalar loss; gradients are finite and nonzero for Q, K, V.
    • Numerical stability: very large-magnitude Q/K should not produce NaNs/Inf; rows fully masked should produce zero weights/output.
    • Masking correctness: padding mask and causal mask both behave as expected.
  3. Explanations
    • Explain why scaling by 1/sqrt(d_k) is used (variance control, softmax non-saturation) with a short derivation/intuition.
    • Clarify when and why you apply normalization:
      • Softmax for attention weights (across keys for each query).
      • LayerNorm (or RMSNorm) around the attention block for training stability, not as a replacement for softmax.
    • Analyze time and memory complexity.
    • Propose optimizations for long sequences.

Solution

Show

Comments (0)

Sign in to leave a comment

Loading comments...

Browse More Questions

More Machine Learning•More Anthropic•More Software Engineer•Anthropic Software Engineer•Anthropic Machine Learning•Software Engineer Machine Learning
PracHub

Master your tech interviews with 7,500+ real questions from top companies.

Product

  • Questions
  • Learning Tracks
  • Interview Guides
  • Resources
  • Premium
  • For Universities
  • Student Access

Browse

  • By Company
  • By Role
  • By Category
  • Topic Hubs
  • SQL Questions
  • Compare Platforms
  • Discord Community

Support

  • support@prachub.com
  • (916) 541-4762

Legal

  • Privacy Policy
  • Terms of Service
  • About Us

© 2026 PracHub. All rights reserved.