PracHub
QuestionsPremiumCoachesLearningGuidesInterview Prep
|Home/Machine Learning/Startups.Com

Explain attention variants and their tradeoffs

Last updated: Mar 29, 2026

Quick Overview

This question evaluates understanding of Transformer attention mechanisms—scaled dot-product, multi-head, group-query and multi-query variants—and measures competency in model internals, tensor shapes, computational and memory complexity, and inference deployment trade-offs.

  • medium
  • Startups.Com
  • Machine Learning
  • Machine Learning Engineer

Explain attention variants and their tradeoffs

Company: Startups.Com

Role: Machine Learning Engineer

Category: Machine Learning

Difficulty: medium

Interview Round: Onsite

You are asked to explain and reason about modern Transformer attention mechanisms. 1) **Scaled dot-product attention** - Define the operation mathematically (including the scaling term) and explain why scaling is used. - Provide the typical tensor shapes for `Q`, `K`, `V` in a batched setting. 2) **Multi-head attention (MHA)** - Explain how MHA differs from single-head attention, including the projection matrices, per-head computation, concatenation, and output projection. - Discuss compute and memory complexity with respect to sequence length `L`, number of heads `H`, and head dimension `d`. 3) **Group-query attention (GQA)** - Define GQA and how it differs from: - MHA (each head has its own K/V) - Multi-query attention (MQA: all heads share one K/V) - Explain why GQA is commonly used for LLM inference. 4) **When would you choose MHA vs GQA vs MQA?** - Discuss quality/expressiveness tradeoffs, KV-cache size, bandwidth, and practical deployment constraints.

Quick Answer: This question evaluates understanding of Transformer attention mechanisms—scaled dot-product, multi-head, group-query and multi-query variants—and measures competency in model internals, tensor shapes, computational and memory complexity, and inference deployment trade-offs.

Startups.Com logo
Startups.Com
Mar 10, 2026, 12:00 AM
Machine Learning Engineer
Onsite
Machine Learning
2
0

You are asked to explain and reason about modern Transformer attention mechanisms.

  1. Scaled dot-product attention
  • Define the operation mathematically (including the scaling term) and explain why scaling is used.
  • Provide the typical tensor shapes for Q , K , V in a batched setting.
  1. Multi-head attention (MHA)
  • Explain how MHA differs from single-head attention, including the projection matrices, per-head computation, concatenation, and output projection.
  • Discuss compute and memory complexity with respect to sequence length L , number of heads H , and head dimension d .
  1. Group-query attention (GQA)
  • Define GQA and how it differs from:
    • MHA (each head has its own K/V)
    • Multi-query attention (MQA: all heads share one K/V)
  • Explain why GQA is commonly used for LLM inference.
  1. When would you choose MHA vs GQA vs MQA?
  • Discuss quality/expressiveness tradeoffs, KV-cache size, bandwidth, and practical deployment constraints.

Solution

Show

Submit Your Answer to Earn 20XP

Sign in to leave a comment

Loading comments...

Browse More Questions

More Machine Learning•More Startups.Com•More Machine Learning Engineer•Startups.Com Machine Learning Engineer•Startups.Com Machine Learning•Machine Learning Engineer Machine Learning
PracHub

Master your tech interviews with 8,000+ real questions from top companies.

Product

  • Questions
  • Learning Tracks
  • Interview Guides
  • Resources
  • Premium
  • For Universities
  • Student Access

Browse

  • By Company
  • By Role
  • By Category
  • Topic Hubs
  • SQL Questions
  • Compare Platforms
  • Discord Community

Support

  • support@prachub.com
  • (916) 541-4762

Legal

  • Privacy Policy
  • Terms of Service
  • About Us

© 2026 PracHub. All rights reserved.