PracHub
QuestionsCoachesLearningGuidesInterview Prep
|Home/Machine Learning/Applied Intuition

Implement correct attention masking

Last updated: Mar 29, 2026

Quick Overview

This question evaluates a candidate's competency in attention mechanisms for autoregressive Transformers, specifically the implementation of causal and padding masks, their combination in multi-head attention, and the ability to diagnose masking-related training anomalies.

  • medium
  • Applied Intuition
  • Machine Learning
  • Machine Learning Engineer

Implement correct attention masking

Company: Applied Intuition

Role: Machine Learning Engineer

Category: Machine Learning

Difficulty: medium

Interview Round: Technical Screen

Implement correct attention masking for an autoregressive Transformer with variable-length sequences and padding. Produce code that constructs (a) a causal mask preventing attention to future tokens and (b) a padding mask that excludes padded positions, and show how to combine and apply them in multi-head attention. Explain common bugs, how they manifest in training loss or accuracy, and how to test for them.

Quick Answer: This question evaluates a candidate's competency in attention mechanisms for autoregressive Transformers, specifically the implementation of causal and padding masks, their combination in multi-head attention, and the ability to diagnose masking-related training anomalies.

Related Interview Questions

  • Implement and explain positional encoding - Applied Intuition (medium)
|Home/Machine Learning/Applied Intuition

Implement correct attention masking

Applied Intuition logo
Applied Intuition
Sep 6, 2025, 12:00 AM
mediumMachine Learning EngineerTechnical ScreenMachine Learning
4
0

Autoregressive Transformer: Correct Attention Masking with Padding

Context: You are implementing decoder self-attention for an autoregressive Transformer where input sequences in a batch are right-padded to a common length. You must prevent attention to future tokens and exclude padded positions.

Task:

  1. Implement a causal mask that prevents attending to future tokens.
  2. Implement a padding mask that prevents attending to padded positions.
  3. Show how to combine and apply these masks in multi-head attention.
  4. Explain common bugs, how they appear in loss/accuracy, and how to test for them.

Assumptions:

  • Batch-first tensors: shape (B, T, D).
  • Right padding. An attention_mask of shape (B, T) uses 1 for real tokens and 0 for padding (or equivalently, you have pad_token_id and input_ids).
  • PyTorch is available.
Loading comments...

Browse More Questions

More Machine Learning•More Applied Intuition•More Machine Learning Engineer•Applied Intuition Machine Learning Engineer•Applied Intuition Machine Learning•Machine Learning Engineer Machine Learning

Write your answer

Your first approved answer each day earns 20 XP.

Sign in to write your answer.
PracHub

Master your tech interviews with 8,000+ real questions from top companies.

Product

  • Questions
  • Learning Tracks
  • Interview Guides
  • Resources
  • Premium
  • For Universities
  • Student Access

Browse

  • By Company
  • By Role
  • By Category
  • Topic Hubs
  • SQL Questions
  • AI Coding Questions
  • Compare Platforms
  • Discord Community

Support

  • support@prachub.com
  • (916) 541-4762

Legal

  • Privacy Policy
  • Terms of Service
  • About Us

© 2026 PracHub. All rights reserved.