Estimate VRAM and compare model parallelism

Q: Estimate VRAM and compare model parallelism

This question evaluates understanding of GPU memory budgeting for large matrix multiplications and the comparative trade-offs between pipeline and tensor model parallelism, assessing competencies in memory sizing, numerical-precision effects (FP16/BF16), communication patterns, and performance metrics.

Q: How do I approach ML System Design interview questions?

ML System Design questions require understanding of core concepts and practice. PracHub provides solutions with explanations to help you master ml system design interviews.

Question

Loading...

You are reasoning about GPU memory and parallelism for a transformer-like workload dominated by matrix multiplications.

Part 1: Can one matmul’s tensors fit in VRAM?

You need to compute an output activation:

Input activation matrix A has shape (m, k)
Weight matrix W has shape (k, n)
Output activation matrix Y has shape (m, n)

Assume dtype is FP16/BF16 unless stated otherwise.

Question: Given a GPU with V bytes of available VRAM (after runtime/fragmentation overhead), can you fit the tensors required for this operation in memory at once?

Consider at least: A , W , Y
Optionally discuss extra memory for workspace (e.g., GEMM algorithms), alignment, and caching.

Part 2: Two GPUs — pipeline parallelism vs tensor parallelism

You have 2 GPUs and want to run end-to-end inference or training.

Explain how you would split the model using:
- Pipeline Parallelism (PP) across 2 GPUs
- Tensor Parallelism (TP) across 2 GPUs
For each approach, discuss:
- End-to-end latency and throughput (including pipeline “bubble” effects for PP)
- Per-GPU memory usage (what is replicated vs sharded)
- Communication patterns and costs
- Key tradeoffs and when you would choose PP vs TP

Estimate VRAM and compare model parallelism

Overview

Part 1: Can one matmul’s tensors fit in VRAM?

Part 2: Two GPUs — pipeline parallelism vs tensor parallelism

Comments (0)