PracHub
QuestionsCoachesLearningGuidesInterview Prep
|Home/Machine Learning/Anthropic

Implement and analyze custom attention

Last updated: May 2, 2026

Quick Overview

This question evaluates implementation and analysis skills for scaled dot-product attention, testing competencies in efficient tensorized PyTorch coding, numerical stability and mixed-precision handling, masking semantics (causal and padding), multi-head shape correctness, and unit-test validation including gradient and edge-case behavior.

  • hard
  • Anthropic
  • Machine Learning
  • Software Engineer

Implement and analyze custom attention

Company: Anthropic

Role: Software Engineer

Category: Machine Learning

Difficulty: hard

Interview Round: Onsite

Implement a custom attention module in PyTorch from scratch. Given Q, K, V tensors, compute attention scores, apply a scaling factor, masking, normalization, and produce the output. Ensure the code runs end-to-end with tests on shape correctness, gradient flow, and numerical stability. Explain why scaling (e.g., by 1/sqrt(d_k)) is used, when and why to apply normalization (e.g., softmax vs. layer normalization), and analyze the time and memory complexity. Propose optimizations for long sequences.

Quick Answer: This question evaluates implementation and analysis skills for scaled dot-product attention, testing competencies in efficient tensorized PyTorch coding, numerical stability and mixed-precision handling, masking semantics (causal and padding), multi-head shape correctness, and unit-test validation including gradient and edge-case behavior.

Related Interview Questions

  • Design a Double Descent Experiment - Anthropic (medium)
  • Explain batch inference design - Anthropic (medium)
  • Debug a GRPO training loop and explain ratios - Anthropic (medium)
  • Implement and derive backprop from scratch - Anthropic (medium)
|Home/Machine Learning/Anthropic

Implement and analyze custom attention

Anthropic logo
Anthropic
Sep 6, 2025, 12:00 AM
hardSoftware EngineerOnsiteMachine Learning
40
0

Implement Scaled Dot-Product Attention in PyTorch (from scratch)

Context

You will implement a numerically stable, vectorized scaled dot-product attention module in PyTorch and validate it with unit tests. Assume Q, K, V can have different sequence lengths (Lq vs Lk) and optionally multiple heads.

Requirements

  1. Implementation
    • Create a PyTorch module that computes scaled dot-product attention:
      • Inputs: Q ∈ R^{B×H×Lq×d_k}, K ∈ R^{B×H×Lk×d_k}, V ∈ R^{B×H×Lk×d_v} (allow H=1 when working without heads; also accept 3D by treating it as H=1).
      • Compute scores S = Q K^T, apply scaling by 1/sqrt(d_k) (unless a custom scale is provided).
      • Support masking:
        • Causal mask (upper-triangular).
        • Padding/attention masks (broadcastable to B×H×Lq×Lk). Boolean masks should treat True as "masked out"; additive masks should be added to scores (e.g., 0 for keep, -inf for disallow).
      • Apply numerically stable softmax to get attention weights; apply optional dropout on weights.
      • Return both attention output (B×H×Lq×d_v) and attention weights (B×H×Lq×Lk).
      • Ensure fp16/bf16 inputs are handled by upcasting to float32 for the softmax and downcasting afterward.
      • Ensure rows that are fully masked produce zero attention weights and zero outputs (no NaNs).
  2. Tests
    • Shape correctness: multiple heads, different Lq and Lk, different d_k and d_v.
    • Gradient flow: backprop from a scalar loss; gradients are finite and nonzero for Q, K, V.
    • Numerical stability: very large-magnitude Q/K should not produce NaNs/Inf; rows fully masked should produce zero weights/output.
    • Masking correctness: padding mask and causal mask both behave as expected.
  3. Explanations
    • Explain why scaling by 1/sqrt(d_k) is used (variance control, softmax non-saturation) with a short derivation/intuition.
    • Clarify when and why you apply normalization:
      • Softmax for attention weights (across keys for each query).
      • LayerNorm (or RMSNorm) around the attention block for training stability, not as a replacement for softmax.
    • Analyze time and memory complexity.
    • Propose optimizations for long sequences.
Loading comments...

Browse More Questions

More Machine Learning•More Anthropic•More Software Engineer•Anthropic Software Engineer•Anthropic Machine Learning•Software Engineer Machine Learning

Write your answer

Your first approved answer each day earns 20 XP.

Sign in to write your answer.
PracHub

Master your tech interviews with 8,000+ real questions from top companies.

Product

  • Questions
  • Learning Tracks
  • Interview Guides
  • Resources
  • Premium
  • For Universities
  • Student Access

Browse

  • By Company
  • By Role
  • By Category
  • Topic Hubs
  • SQL Questions
  • AI Coding Questions
  • Compare Platforms
  • Discord Community

Support

  • support@prachub.com
  • (916) 541-4762

Legal

  • Privacy Policy
  • Terms of Service
  • About Us

© 2026 PracHub. All rights reserved.