Estimate VRAM and compare model parallelism
Company: Anthropic
Role: Software Engineer
Category: ML System Design
Difficulty: hard
Interview Round: Onsite
You are a performance engineer reasoning about GPU memory and parallelism for a transformer-like workload whose runtime is dominated by large matrix multiplications (GEMMs). You will first estimate whether a single matmul's tensors fit in one GPU's memory, then compare two ways to split the model across **2 GPUs**.
### Constraints & Assumptions
- Default dtype is **FP16/BF16** (2 bytes per element) unless a part states otherwise; FP32 is 4 bytes.
- A single GPU exposes `V` bytes of *usable* VRAM, i.e. after the CUDA context, driver reserves, and allocator fragmentation have been subtracted.
- The matmul under study is $Y = A \cdot W$ with $A \in \mathbb{R}^{m \times k}$, $W \in \mathbb{R}^{k \times n}$, $Y \in \mathbb{R}^{m \times n}$.
- For Part 2 you have exactly **2 GPUs** connected by some interconnect (e.g. NVLink or PCIe); treat the per-GPU memory budget and the interconnect bandwidth/latency as the levers you reason about.
### Clarifying Questions to Ask
- Is this **inference or training**? (Training adds gradients, optimizer state, and retained activations for backprop.)
- What is the **dtype**, and are accumulations done in FP32 even when storage is FP16/BF16?
- How much VRAM should I **reserve as headroom** for GEMM workspace, fragmentation, and other framework buffers?
- For Part 2, what is the **interconnect** (NVLink vs PCIe) and its bandwidth/latency, and what are the batch/sequence dimensions?
- Is the bottleneck **fitting the model (memory-bound)** or **per-step latency/throughput (compute/comm-bound)**?
### Part 1: Can one matmul's tensors fit in VRAM?
You need to compute the output activation $Y = A \cdot W$ on a single GPU with `V` bytes of usable VRAM.
Derive the memory required to hold the operands and result at once, and state a clear, defensible fit/no-fit decision rule. Account for at least `A`, `W`, and `Y`, and discuss the additional ("hidden") memory a real kernel consumes.
```hint Where to start
Think about what you need to store simultaneously: the input, the weight matrix, and the output. For each tensor, what two quantities determine how many bytes it occupies?
```
```hint Don't forget the overhead
The theoretical minimum storage is a necessary but not sufficient condition. What else does a GEMM kernel need beyond the named tensors? How does the answer change if you are training rather than just running inference?
```
#### What This Part Should Cover
- A correct byte formula for the three tensors and an explicit fit inequality against `V`.
- Awareness that dtype changes $b$ (FP16/BF16 = 2, FP32 = 4) and that accumulation may still be FP32.
- Naming concrete *hidden* consumers (GEMM workspace, fragmentation, bias/residual, KV cache for inference, retained activations / gradients / optimizer state for training) and folding them into a headroom factor.
- Distinguishing the inference vs training memory profiles rather than giving one number.
### Part 2: Two GPUs — pipeline parallelism vs tensor parallelism
You want to run the model end-to-end across **2 GPUs**. Explain how you would split the model under **Pipeline Parallelism (PP)** and under **Tensor Parallelism (TP)**, and compare them.
For each approach, address: (a) how weights and activations are partitioned, (b) per-GPU memory usage (what is replicated vs sharded), (c) the communication pattern and its cost, and (d) the effect on end-to-end latency and throughput — including the **pipeline bubble** for PP. Conclude with when you would pick PP vs TP.
```hint Think about the split axis
PP and TP answer different questions about *where* to draw the boundary. Ask yourself: is each GPU responsible for certain *layers* of the model, or for certain *parts* of each layer? How does that choice determine what each GPU must store and what it must communicate?
```
```hint Reasoning about the PP bubble
When only one GPU is active at a time, that is dead time you cannot recover. Think about what drives how much of the total wall-clock is wasted that way, and what you could do to the incoming workload to reduce the waste.
```
```hint Thinking about communication cost
Consider how many times the two GPUs must exchange data per input and what the *volume* of that exchange is. Does it happen once at a coarse boundary, or once per layer? How does the answer differ between PP and TP, and why does interconnect speed matter more for one than the other?
```
#### Clarifying Questions for this Part
- Will requests be **microbatched** (which lets PP hide its bubble) or served one at a time (which exposes it)?
- Is the goal to minimize **single-request latency** or to maximize **aggregate throughput**?
- Is the **interconnect** fast enough (NVLink) to make per-layer TP collectives cheap, or is it PCIe?
#### What This Part Should Cover
- A correct, concrete description of how each strategy partitions the model, covering the split axis for both PP and TP.
- Per-GPU memory accounting: what is sharded vs replicated under each scheme.
- The communication pattern for each strategy: frequency, volume, and operation type (point-to-point vs collective), and why that matters for interconnect sensitivity.
- The latency/throughput story for PP: what causes idle time, how that idle fraction depends on batch structure, and how microbatching helps.
- A defensible decision rule: when to prefer PP, when to prefer TP, and when to combine them.
### Follow-up Questions
- For the column-parallel vs row-parallel TP split of $Y = A \cdot W$, which dimension does each shard ($n$ vs $k$) and which one requires an all-reduce vs an all-gather to reconstruct the output?
- In a standard transformer block (attention + MLP), where exactly do TP's collectives land, and how many per block per forward pass?
- If you had **8 GPUs** instead of 2, how would you combine PP, TP, and data parallelism, and what would set the TP degree vs the PP degree?
- How do training-only memory consumers (gradients, optimizer state, retained activations) change your Part 1 fit estimate, and what techniques (activation checkpointing, ZeRO/FSDP sharding) would you reach for first?
Quick Answer: This question evaluates understanding of GPU memory budgeting for large matrix multiplications and the comparative trade-offs between pipeline and tensor model parallelism, assessing competencies in memory sizing, numerical-precision effects (FP16/BF16), communication patterns, and performance metrics.