PracHub
QuestionsCoachesLearningGuidesInterview Prep
|Home/ML System Design/Anthropic

Estimate VRAM and compare model parallelism

Last updated: Jun 21, 2026

Quick Overview

This question evaluates understanding of GPU memory budgeting for large matrix multiplications and the comparative trade-offs between pipeline and tensor model parallelism, assessing competencies in memory sizing, numerical-precision effects (FP16/BF16), communication patterns, and performance metrics.

  • hard
  • Anthropic
  • ML System Design
  • Software Engineer

Estimate VRAM and compare model parallelism

Company: Anthropic

Role: Software Engineer

Category: ML System Design

Difficulty: hard

Interview Round: Onsite

You are a performance engineer reasoning about GPU memory and parallelism for a transformer-like workload whose runtime is dominated by large matrix multiplications (GEMMs). You will first estimate whether a single matmul's tensors fit in one GPU's memory, then compare two ways to split the model across **2 GPUs**. ### Constraints & Assumptions - Default dtype is **FP16/BF16** (2 bytes per element) unless a part states otherwise; FP32 is 4 bytes. - A single GPU exposes `V` bytes of *usable* VRAM, i.e. after the CUDA context, driver reserves, and allocator fragmentation have been subtracted. - The matmul under study is $Y = A \cdot W$ with $A \in \mathbb{R}^{m \times k}$, $W \in \mathbb{R}^{k \times n}$, $Y \in \mathbb{R}^{m \times n}$. - For Part 2 you have exactly **2 GPUs** connected by some interconnect (e.g. NVLink or PCIe); treat the per-GPU memory budget and the interconnect bandwidth/latency as the levers you reason about. ### Clarifying Questions to Ask - Is this **inference or training**? (Training adds gradients, optimizer state, and retained activations for backprop.) - What is the **dtype**, and are accumulations done in FP32 even when storage is FP16/BF16? - How much VRAM should I **reserve as headroom** for GEMM workspace, fragmentation, and other framework buffers? - For Part 2, what is the **interconnect** (NVLink vs PCIe) and its bandwidth/latency, and what are the batch/sequence dimensions? - Is the bottleneck **fitting the model (memory-bound)** or **per-step latency/throughput (compute/comm-bound)**? ### Part 1: Can one matmul's tensors fit in VRAM? You need to compute the output activation $Y = A \cdot W$ on a single GPU with `V` bytes of usable VRAM. Derive the memory required to hold the operands and result at once, and state a clear, defensible fit/no-fit decision rule. Account for at least `A`, `W`, and `Y`, and discuss the additional ("hidden") memory a real kernel consumes. ```hint Where to start Think about what you need to store simultaneously: the input, the weight matrix, and the output. For each tensor, what two quantities determine how many bytes it occupies? ``` ```hint Don't forget the overhead The theoretical minimum storage is a necessary but not sufficient condition. What else does a GEMM kernel need beyond the named tensors? How does the answer change if you are training rather than just running inference? ``` #### What This Part Should Cover - A correct byte formula for the three tensors and an explicit fit inequality against `V`. - Awareness that dtype changes $b$ (FP16/BF16 = 2, FP32 = 4) and that accumulation may still be FP32. - Naming concrete *hidden* consumers (GEMM workspace, fragmentation, bias/residual, KV cache for inference, retained activations / gradients / optimizer state for training) and folding them into a headroom factor. - Distinguishing the inference vs training memory profiles rather than giving one number. ### Part 2: Two GPUs — pipeline parallelism vs tensor parallelism You want to run the model end-to-end across **2 GPUs**. Explain how you would split the model under **Pipeline Parallelism (PP)** and under **Tensor Parallelism (TP)**, and compare them. For each approach, address: (a) how weights and activations are partitioned, (b) per-GPU memory usage (what is replicated vs sharded), (c) the communication pattern and its cost, and (d) the effect on end-to-end latency and throughput — including the **pipeline bubble** for PP. Conclude with when you would pick PP vs TP. ```hint Think about the split axis PP and TP answer different questions about *where* to draw the boundary. Ask yourself: is each GPU responsible for certain *layers* of the model, or for certain *parts* of each layer? How does that choice determine what each GPU must store and what it must communicate? ``` ```hint Reasoning about the PP bubble When only one GPU is active at a time, that is dead time you cannot recover. Think about what drives how much of the total wall-clock is wasted that way, and what you could do to the incoming workload to reduce the waste. ``` ```hint Thinking about communication cost Consider how many times the two GPUs must exchange data per input and what the *volume* of that exchange is. Does it happen once at a coarse boundary, or once per layer? How does the answer differ between PP and TP, and why does interconnect speed matter more for one than the other? ``` #### Clarifying Questions for this Part - Will requests be **microbatched** (which lets PP hide its bubble) or served one at a time (which exposes it)? - Is the goal to minimize **single-request latency** or to maximize **aggregate throughput**? - Is the **interconnect** fast enough (NVLink) to make per-layer TP collectives cheap, or is it PCIe? #### What This Part Should Cover - A correct, concrete description of how each strategy partitions the model, covering the split axis for both PP and TP. - Per-GPU memory accounting: what is sharded vs replicated under each scheme. - The communication pattern for each strategy: frequency, volume, and operation type (point-to-point vs collective), and why that matters for interconnect sensitivity. - The latency/throughput story for PP: what causes idle time, how that idle fraction depends on batch structure, and how microbatching helps. - A defensible decision rule: when to prefer PP, when to prefer TP, and when to combine them. ### Follow-up Questions - For the column-parallel vs row-parallel TP split of $Y = A \cdot W$, which dimension does each shard ($n$ vs $k$) and which one requires an all-reduce vs an all-gather to reconstruct the output? - In a standard transformer block (attention + MLP), where exactly do TP's collectives land, and how many per block per forward pass? - If you had **8 GPUs** instead of 2, how would you combine PP, TP, and data parallelism, and what would set the TP degree vs the PP degree? - How do training-only memory consumers (gradients, optimizer state, retained activations) change your Part 1 fit estimate, and what techniques (activation checkpointing, ZeRO/FSDP sharding) would you reach for first?

Quick Answer: This question evaluates understanding of GPU memory budgeting for large matrix multiplications and the comparative trade-offs between pipeline and tensor model parallelism, assessing competencies in memory sizing, numerical-precision effects (FP16/BF16), communication patterns, and performance metrics.

Related Interview Questions

  • Design Model Weight Distribution - Anthropic (medium)
  • Design GPU inference request batching - Anthropic
  • How do you handle an LLM agents interview? - Anthropic (hard)
  • Design a prompt playground - Anthropic (medium)
  • Design a model downloader - Anthropic (medium)
|Home/ML System Design/Anthropic

Estimate VRAM and compare model parallelism

Anthropic logo
Anthropic
Nov 19, 2025, 12:00 AM
hardSoftware EngineerOnsiteML System Design
30
0
Loading...

You are a performance engineer reasoning about GPU memory and parallelism for a transformer-like workload whose runtime is dominated by large matrix multiplications (GEMMs). You will first estimate whether a single matmul's tensors fit in one GPU's memory, then compare two ways to split the model across 2 GPUs.

Constraints & Assumptions

  • Default dtype is FP16/BF16 (2 bytes per element) unless a part states otherwise; FP32 is 4 bytes.
  • A single GPU exposes V bytes of usable VRAM, i.e. after the CUDA context, driver reserves, and allocator fragmentation have been subtracted.
  • The matmul under study is Y=A⋅WY = A \cdot WY=A⋅W with A∈Rm×kA \in \mathbb{R}^{m \times k}A∈Rm×k , W∈Rk×nW \in \mathbb{R}^{k \times n}W∈Rk×n , Y∈Rm×nY \in \mathbb{R}^{m \times n}Y∈Rm×n .
  • For Part 2 you have exactly 2 GPUs connected by some interconnect (e.g. NVLink or PCIe); treat the per-GPU memory budget and the interconnect bandwidth/latency as the levers you reason about.

Clarifying Questions to Ask

  • Is this inference or training ? (Training adds gradients, optimizer state, and retained activations for backprop.)
  • What is the dtype , and are accumulations done in FP32 even when storage is FP16/BF16?
  • How much VRAM should I reserve as headroom for GEMM workspace, fragmentation, and other framework buffers?
  • For Part 2, what is the interconnect (NVLink vs PCIe) and its bandwidth/latency, and what are the batch/sequence dimensions?
  • Is the bottleneck fitting the model (memory-bound) or per-step latency/throughput (compute/comm-bound) ?

Part 1: Can one matmul's tensors fit in VRAM?

You need to compute the output activation Y=A⋅WY = A \cdot WY=A⋅W on a single GPU with V bytes of usable VRAM.

Derive the memory required to hold the operands and result at once, and state a clear, defensible fit/no-fit decision rule. Account for at least A, W, and Y, and discuss the additional ("hidden") memory a real kernel consumes.

What This Part Should Cover

  • A correct byte formula for the three tensors and an explicit fit inequality against V .
  • Awareness that dtype changes bbb (FP16/BF16 = 2, FP32 = 4) and that accumulation may still be FP32.
  • Naming concrete hidden consumers (GEMM workspace, fragmentation, bias/residual, KV cache for inference, retained activations / gradients / optimizer state for training) and folding them into a headroom factor.
  • Distinguishing the inference vs training memory profiles rather than giving one number.

Part 2: Two GPUs — pipeline parallelism vs tensor parallelism

You want to run the model end-to-end across 2 GPUs. Explain how you would split the model under Pipeline Parallelism (PP) and under Tensor Parallelism (TP), and compare them.

For each approach, address: (a) how weights and activations are partitioned, (b) per-GPU memory usage (what is replicated vs sharded), (c) the communication pattern and its cost, and (d) the effect on end-to-end latency and throughput — including the pipeline bubble for PP. Conclude with when you would pick PP vs TP.

Clarifying Questions for this Part

  • Will requests be microbatched (which lets PP hide its bubble) or served one at a time (which exposes it)?
  • Is the goal to minimize single-request latency or to maximize aggregate throughput ?
  • Is the interconnect fast enough (NVLink) to make per-layer TP collectives cheap, or is it PCIe?

What This Part Should Cover

  • A correct, concrete description of how each strategy partitions the model, covering the split axis for both PP and TP.
  • Per-GPU memory accounting: what is sharded vs replicated under each scheme.
  • The communication pattern for each strategy: frequency, volume, and operation type (point-to-point vs collective), and why that matters for interconnect sensitivity.
  • The latency/throughput story for PP: what causes idle time, how that idle fraction depends on batch structure, and how microbatching helps.
  • A defensible decision rule: when to prefer PP, when to prefer TP, and when to combine them.

Follow-up Questions

  • For the column-parallel vs row-parallel TP split of Y=A⋅WY = A \cdot WY=A⋅W , which dimension does each shard ( nnn vs kkk ) and which one requires an all-reduce vs an all-gather to reconstruct the output?
  • In a standard transformer block (attention + MLP), where exactly do TP's collectives land, and how many per block per forward pass?
  • If you had 8 GPUs instead of 2, how would you combine PP, TP, and data parallelism, and what would set the TP degree vs the PP degree?
  • How do training-only memory consumers (gradients, optimizer state, retained activations) change your Part 1 fit estimate, and what techniques (activation checkpointing, ZeRO/FSDP sharding) would you reach for first?

Submit Your Answer to Earn 20XP

Sign in to leave a comment

Loading comments...

Browse More Questions

More ML System Design•More Anthropic•More Software Engineer•Anthropic Software Engineer•Anthropic ML System Design•Software Engineer ML System Design

Your design canvas — auto-saved

PracHub

Master your tech interviews with 8,000+ real questions from top companies.

Product

  • Questions
  • Learning Tracks
  • Interview Guides
  • Resources
  • Premium
  • For Universities
  • Student Access

Browse

  • By Company
  • By Role
  • By Category
  • Topic Hubs
  • SQL Questions
  • AI Coding Questions
  • Compare Platforms
  • Discord Community

Support

  • support@prachub.com
  • (916) 541-4762

Legal

  • Privacy Policy
  • Terms of Service
  • About Us

© 2026 PracHub. All rights reserved.