PracHub
QuestionsPremiumLearningGuidesCheatsheetNEWCoaches
|Home/Machine Learning/Applied Intuition

Implement correct attention masking

Last updated: Mar 29, 2026

Quick Overview

This question evaluates a candidate's competency in attention mechanisms for autoregressive Transformers, specifically the implementation of causal and padding masks, their combination in multi-head attention, and the ability to diagnose masking-related training anomalies.

  • medium
  • Applied Intuition
  • Machine Learning
  • Machine Learning Engineer

Implement correct attention masking

Company: Applied Intuition

Role: Machine Learning Engineer

Category: Machine Learning

Difficulty: medium

Interview Round: Technical Screen

Implement correct attention masking for an autoregressive Transformer with variable-length sequences and padding. Produce code that constructs (a) a causal mask preventing attention to future tokens and (b) a padding mask that excludes padded positions, and show how to combine and apply them in multi-head attention. Explain common bugs, how they manifest in training loss or accuracy, and how to test for them.

Quick Answer: This question evaluates a candidate's competency in attention mechanisms for autoregressive Transformers, specifically the implementation of causal and padding masks, their combination in multi-head attention, and the ability to diagnose masking-related training anomalies.

Related Interview Questions

  • Implement and explain positional encoding - Applied Intuition (medium)
Applied Intuition logo
Applied Intuition
Sep 6, 2025, 12:00 AM
Machine Learning Engineer
Technical Screen
Machine Learning
2
0

Autoregressive Transformer: Correct Attention Masking with Padding

Context: You are implementing decoder self-attention for an autoregressive Transformer where input sequences in a batch are right-padded to a common length. You must prevent attention to future tokens and exclude padded positions.

Task:

  1. Implement a causal mask that prevents attending to future tokens.
  2. Implement a padding mask that prevents attending to padded positions.
  3. Show how to combine and apply these masks in multi-head attention.
  4. Explain common bugs, how they appear in loss/accuracy, and how to test for them.

Assumptions:

  • Batch-first tensors: shape (B, T, D).
  • Right padding. An attention_mask of shape (B, T) uses 1 for real tokens and 0 for padding (or equivalently, you have pad_token_id and input_ids).
  • PyTorch is available.

Solution

Show

Comments (0)

Sign in to leave a comment

Loading comments...

Browse More Questions

More Machine Learning•More Applied Intuition•More Machine Learning Engineer•Applied Intuition Machine Learning Engineer•Applied Intuition Machine Learning•Machine Learning Engineer Machine Learning
PracHub

Master your tech interviews with 7,500+ real questions from top companies.

Product

  • Questions
  • Learning Tracks
  • Interview Guides
  • Resources
  • Premium
  • For Universities
  • Student Access

Browse

  • By Company
  • By Role
  • By Category
  • Topic Hubs
  • SQL Questions
  • Compare Platforms
  • Discord Community

Support

  • support@prachub.com
  • (916) 541-4762

Legal

  • Privacy Policy
  • Terms of Service
  • About Us

© 2026 PracHub. All rights reserved.