LLMs 42. LLMs: Hardware, Memory, and Performance
Quick Overview
This guide examines LLM hardware, memory, and performance from a systems perspective, covering model size versus actual memory usage, training versus inference memory costs, precision and throughput trade-offs, KV cache and activation scaling, and optimization techniques such as gradient checkpointing and LoRA.

Understanding Large Language Models Through a Systems Lens
A Practical Learning Resource for Candidates
Large Language Models (LLMs) are often discussed in terms of parameter counts—7B, 13B, 65B—but this framing hides what actually determines whether a model can be trained, deployed, or scaled in practice. This post is designed to help candidates move beyond surface-level knowledge and develop systems intuition: the kind of understanding that transfers across models, hardware, and even adjacent domains like computer vision or recommendation systems.
Rather than listing facts, we focus on how things interact—model size, memory, precision, and throughput—and why those interactions matter.
Model Size Is Not the Whole Story
When people say “this is a 7B model,” they usually mean the number of parameters. That number tells you how expressive the model is, but it does not tell you how expensive it is to run.
At minimum, parameters must be stored in GPU memory. A 7B model in FP16 already requires around 14 GB just for weights. But real workloads include much more: temporary activations, attention caches, framework buffers, and sometimes optimizer state. As a result, the difference between theoretical size and actual memory usage is often several multiples.
The key learning here is that feasibility is not binary. A model may technically fit on a GPU but still be impractical due to memory fragmentation, latency, or poor utilization.
Why Training Is Much Harder Than Inference
Inference feels simple: load the model, pass tokens through it, generate outputs. Training is fundamentally different because it must remember information about the forward pass in order to compute gradients.
During training, memory is consumed by:
- The model weights themselves
- Gradients for every parameter
- Optimizer state (often 2–4× the size of parameters)
- Activations saved for backpropagation
This is why training memory can reach 6–8× the raw model size, even before accounting for batch size or sequence length. Techniques like gradient checkpointing and LoRA reduce this cost, but they don’t eliminate it.
This idea generalizes well: whenever a system must learn, it pays a much higher memory and compute price than when it merely executes.
Memory Is a First-Class Design Constraint
Many candidates treat GPU memory as a fixed number—“this card has 24 GB.” In practice, memory behaves more like a budget that must be actively managed.
Small changes can have large effects:
- Doubling sequence length doubles KV cache usage
- Increasing batch size linearly increases activation memory
- Switching precision (FP32 → BF16) can cut memory in half
Understanding these tradeoffs is often more valuable than memorizing any specific GPU spec.
A useful mental model is to think of memory in layers: persistent (weights), semi-persistent (optimizer state), and ephemeral (activations). Each layer has different optimization strategies.
Precision Is a Systems Choice, Not a Detail
Precision formats—FP32, FP16, BF16, INT8—are often discussed as numerical details, but they are really systems-level decisions.
Lower precision reduces memory pressure and increases throughput, but it can also introduce instability or accuracy loss. BF16 exists precisely because FP16’s limited exponent range caused training failures in large models.
What matters is not which precision is “best,” but why a particular precision is chosen given hardware, workload, and risk tolerance. This reasoning applies equally in other domains, such as training vision transformers or large embedding models.
Measuring Performance: FLOPs vs Throughput
FLOPs measure how much computation a model theoretically requires. Throughput measures how much useful work is actually done per second.
In real systems, throughput is often the better metric because it captures:
- Memory bandwidth limits
- Kernel launch overhead
- CPU–GPU synchronization
- Inefficiencies in parallelization
A GPU can advertise enormous FLOPs, yet deliver disappointing throughput if memory access becomes the bottleneck. This distinction is crucial when evaluating optimizations or comparing hardware.
Tools Matter Because They Shape Thinking
System-level understanding improves dramatically once you learn to observe what’s happening:
nvidia-smishows memory pressure and utilization in real time- Profilers reveal where time is actually spent
- Topology tools explain why multi-GPU scaling sometimes fails
The lesson here isn’t about specific commands—it’s about developing the habit of debugging with evidence, not assumptions.
How This Knowledge Transfers Beyond LLMs
Everything discussed here applies to other large-scale systems:
- Vision models also suffer from activation explosion
- Recommendation systems struggle with embedding memory
- Reinforcement learning workloads face similar optimizer overhead
Once you understand how memory, compute, and precision interact, new architectures feel less mysterious. You start asking better questions, faster.
Final Takeaway
Strong candidates don’t just know what a model is. They understand:
- What limits it
- What scales it
- What breaks it
This systems mindset turns LLM knowledge into a durable skill—one that remains valuable even as models, frameworks, and hardware evolve.
If you can explain why something is slow, large, or unstable, you’re already thinking like an engineer who can scale real systems.
Comments (0)