PracHub
QuestionsPremiumLearningGuidesCheatsheetNEWCoaches
|Home/Machine Learning/Uber

Implement Multi-Head Self-Attention

Last updated: Mar 29, 2026

Quick Overview

This question evaluates understanding of multi-head self-attention and the competency to implement transformer attention modules using learned Q/K/V projections, head-wise tensor reshaping, attention masking, and PyTorch tensor operations.

  • medium
  • Uber
  • Machine Learning
  • Machine Learning Engineer

Implement Multi-Head Self-Attention

Company: Uber

Role: Machine Learning Engineer

Category: Machine Learning

Difficulty: medium

Interview Round: Technical Screen

Implement a **multi-head self-attention** module in **PyTorch** without using `torch.nn.MultiheadAttention`. Requirements: - Input tensor shape: `(batch_size, seq_len, d_model)` - Number of heads: `num_heads`, where `d_model % num_heads == 0` - Use learned linear projections to produce **Q**, **K**, and **V** from the same input - Split the projected tensors into multiple heads - Compute scaled dot-product attention for each head: `Attention(Q, K, V) = softmax(Q K^T / sqrt(head_dim)) V` - Support an optional attention mask so masked positions do not contribute to attention scores - Concatenate all heads and apply a final output projection - Return a tensor of shape `(batch_size, seq_len, d_model)` You may assume standard PyTorch layers such as `nn.Linear`, `torch.matmul`, `softmax`, `view`, and `transpose` are available. Explain any important tensor shape transformations and common implementation pitfalls.

Quick Answer: This question evaluates understanding of multi-head self-attention and the competency to implement transformer attention modules using learned Q/K/V projections, head-wise tensor reshaping, attention masking, and PyTorch tensor operations.

Related Interview Questions

  • Evaluate Promotions for Uber Eats Users - Uber (medium)
  • Implement Streaming Clustering for Numbers - Uber
  • Build cold-start restaurant ratings - Uber (medium)
  • Implement CLIP Contrastive Loss - Uber (medium)
  • Predict driver acceptance - Uber (medium)
Uber logo
Uber
Jan 10, 2026, 12:00 AM
Machine Learning Engineer
Technical Screen
Machine Learning
6
0
Loading...

Implement a multi-head self-attention module in PyTorch without using torch.nn.MultiheadAttention.

Requirements:

  • Input tensor shape: (batch_size, seq_len, d_model)
  • Number of heads: num_heads , where d_model % num_heads == 0
  • Use learned linear projections to produce Q , K , and V from the same input
  • Split the projected tensors into multiple heads
  • Compute scaled dot-product attention for each head: Attention(Q, K, V) = softmax(Q K^T / sqrt(head_dim)) V
  • Support an optional attention mask so masked positions do not contribute to attention scores
  • Concatenate all heads and apply a final output projection
  • Return a tensor of shape (batch_size, seq_len, d_model)

You may assume standard PyTorch layers such as nn.Linear, torch.matmul, softmax, view, and transpose are available.

Explain any important tensor shape transformations and common implementation pitfalls.

Solution

Show

Comments (0)

Sign in to leave a comment

Loading comments...

Browse More Questions

More Machine Learning•More Uber•More Machine Learning Engineer•Uber Machine Learning Engineer•Uber Machine Learning•Machine Learning Engineer Machine Learning
PracHub

Master your tech interviews with 7,500+ real questions from top companies.

Product

  • Questions
  • Learning Tracks
  • Interview Guides
  • Resources
  • Premium
  • For Universities
  • Student Access

Browse

  • By Company
  • By Role
  • By Category
  • Topic Hubs
  • SQL Questions
  • Compare Platforms
  • Discord Community

Support

  • support@prachub.com
  • (916) 541-4762

Legal

  • Privacy Policy
  • Terms of Service
  • About Us

© 2026 PracHub. All rights reserved.