PracHub
QuestionsPremiumLearningGuidesCheatsheetNEWCoaches
|Home/ML System Design/Anthropic

Estimate VRAM and compare model parallelism

Last updated: Mar 29, 2026

Quick Overview

This question evaluates understanding of GPU memory budgeting for large matrix multiplications and the comparative trade-offs between pipeline and tensor model parallelism, assessing competencies in memory sizing, numerical-precision effects (FP16/BF16), communication patterns, and performance metrics.

  • hard
  • Anthropic
  • ML System Design
  • Software Engineer

Estimate VRAM and compare model parallelism

Company: Anthropic

Role: Software Engineer

Category: ML System Design

Difficulty: hard

Interview Round: Onsite

You are reasoning about GPU memory and parallelism for a transformer-like workload dominated by matrix multiplications. ### Part 1: Can one matmul’s tensors fit in VRAM? You need to compute an output activation: - Input activation matrix **A** has shape `(m, k)` - Weight matrix **W** has shape `(k, n)` - Output activation matrix **Y** has shape `(m, n)` Assume dtype is FP16/BF16 unless stated otherwise. **Question:** Given a GPU with `V` bytes of available VRAM (after runtime/fragmentation overhead), can you fit the tensors required for this operation in memory at once? - Consider at least: `A`, `W`, `Y` - Optionally discuss extra memory for workspace (e.g., GEMM algorithms), alignment, and caching. ### Part 2: Two GPUs — pipeline parallelism vs tensor parallelism You have **2 GPUs** and want to run end-to-end inference or training. 1. Explain how you would split the model using: - **Pipeline Parallelism (PP)** across 2 GPUs - **Tensor Parallelism (TP)** across 2 GPUs 2. For each approach, discuss: - End-to-end latency and throughput (including pipeline “bubble” effects for PP) - Per-GPU memory usage (what is replicated vs sharded) - Communication patterns and costs - Key tradeoffs and when you would choose PP vs TP

Quick Answer: This question evaluates understanding of GPU memory budgeting for large matrix multiplications and the comparative trade-offs between pipeline and tensor model parallelism, assessing competencies in memory sizing, numerical-precision effects (FP16/BF16), communication patterns, and performance metrics.

Related Interview Questions

  • Design GPU inference request batching - Anthropic
  • How do you handle an LLM agents interview? - Anthropic (hard)
  • Design a prompt playground - Anthropic (medium)
  • Design a model downloader - Anthropic (medium)
  • Design a high-concurrency LLM inference service - Anthropic (hard)
Anthropic logo
Anthropic
Nov 19, 2025, 12:00 AM
Software Engineer
Onsite
ML System Design
26
0
Loading...

You are reasoning about GPU memory and parallelism for a transformer-like workload dominated by matrix multiplications.

Part 1: Can one matmul’s tensors fit in VRAM?

You need to compute an output activation:

  • Input activation matrix A has shape (m, k)
  • Weight matrix W has shape (k, n)
  • Output activation matrix Y has shape (m, n)

Assume dtype is FP16/BF16 unless stated otherwise.

Question: Given a GPU with V bytes of available VRAM (after runtime/fragmentation overhead), can you fit the tensors required for this operation in memory at once?

  • Consider at least: A , W , Y
  • Optionally discuss extra memory for workspace (e.g., GEMM algorithms), alignment, and caching.

Part 2: Two GPUs — pipeline parallelism vs tensor parallelism

You have 2 GPUs and want to run end-to-end inference or training.

  1. Explain how you would split the model using:
    • Pipeline Parallelism (PP) across 2 GPUs
    • Tensor Parallelism (TP) across 2 GPUs
  2. For each approach, discuss:
    • End-to-end latency and throughput (including pipeline “bubble” effects for PP)
    • Per-GPU memory usage (what is replicated vs sharded)
    • Communication patterns and costs
    • Key tradeoffs and when you would choose PP vs TP

Solution

Show

Comments (0)

Sign in to leave a comment

Loading comments...

Browse More Questions

More ML System Design•More Anthropic•More Software Engineer•Anthropic Software Engineer•Anthropic ML System Design•Software Engineer ML System Design
PracHub

Master your tech interviews with 7,500+ real questions from top companies.

Product

  • Questions
  • Learning Tracks
  • Interview Guides
  • Resources
  • Premium
  • For Universities
  • Student Access

Browse

  • By Company
  • By Role
  • By Category
  • Topic Hubs
  • SQL Questions
  • Compare Platforms
  • Discord Community

Support

  • support@prachub.com
  • (916) 541-4762

Legal

  • Privacy Policy
  • Terms of Service
  • About Us

© 2026 PracHub. All rights reserved.