Fine-Tuning LLMs in the Real World

A learning-focused resource on memory budgeting, data strategy, catastrophic forgetting, and training stability

Why fine-tuning is harder than it looks

People usually approach fine-tuning like a recipe: pick a base model, feed it instruction data, and expect it to “learn the domain.” That mental model breaks quickly in practice. Fine-tuning is a controlled distribution shift: it can sharpen behavior, but it can also narrow capability, destabilize optimization, and burn GPU memory far faster than expected.

The goal of this post is to help you reason about fine-tuning as an engineering system: compute + data + optimization + evaluation. If you can explain the trade-offs clearly, you’re already ahead of most candidates.

1) How much GPU memory do you need?

Memory usage depends on more than parameters. The true driver is what you store during training: parameters, optimizer states, gradients, and activations. “Full-parameter fine-tuning” is expensive because you update everything.

A useful rule: with Adam/AdamW, optimizer states dominate. In mixed precision training, you typically keep:

model weights (fp16/bf16)
gradients
optimizer moments (m, v) in fp32 or similar
activations (often the largest term when sequence length grows)

This is why a 7B model might run inference on a 16GB GPU but full fine-tune can demand far more, especially with long sequences and large batch.

Practical heuristics you’ll see in the wild:

7B full fine-tuning often becomes realistic around 16–20GB+ only under aggressive memory tactics (checkpointing, sharding, offload, smaller sequences).
For scale training setups, teams often use multi-GPU (FSDP/ZeRO) even for 7B because they want higher throughput and safer headroom.

If you want clean mental math: think in terms of tokens-per-step. The product of batch_size × sequence_length sets your activation footprint, and that often drives OOM long before parameters do.

2) Why an LLM can feel worse after SFT

SFT (supervised fine-tuning) is often misunderstood as “teaching the model new knowledge.” In reality, instruction tuning mostly reshapes behavior: formatting, refusal style, helpfulness, task conditioning, and response structure. If your SFT dataset is small or narrow, you can unintentionally train the model into a smaller behavioral manifold.

The most common failure modes are boring and fixable:

Data distribution collapse: the model learns to respond in a single style because your instruction data is homogeneous.
Overfitting: too many near-duplicate tasks or too many examples per task.
Learning rate too large: you overwrite useful representations, especially in early layers.
Wrong base: instruction-tuning a base model with chat-style data (or the reverse) introduces format mismatch, which looks like “the model got dumber.”

A good mental model is: SFT is a steering wheel, not an engine rebuild. If you steer too hard with low-quality signals, you can drive off the road.

3) How to construct SFT instruction-tuning data

High-quality SFT data is representative, diverse, and balanced. The goal is not raw volume; it is coverage of behaviors you care about without letting one pattern dominate.

What good construction looks like in practice:

Pick a small set of representative task families (Q&A, extraction, rewrite, reasoning, multi-turn helpdesk, tool-style outputs, etc.) and ensure the dataset isn’t accidentally 80% of one family.
Keep each task family from becoming too large relative to the others. Over-weighting one task is a reliable way to create “my model only does this one thing well.”
Increase diversity through prompt variations: different tones, constraints, formats, lengths, and edge cases. This is where robustness comes from.
Use a clean separation between training and evaluation prompts. Leakage is the easiest way to fool yourself.

When candidates talk about SFT data, listen for whether they treat it as behavior shaping (correct) rather than “knowledge injection” (often incorrect).

4) Domain-specific continue pretraining: how to choose data

Continue Pre-Training (CPT) is what you use when you genuinely want the model to absorb domain language patterns and knowledge at scale. The data selection question becomes: what content has high knowledge density and low noise?

In many successful domain CPT efforts, the best sources are not random web pages. They are:

curated technical documents, manuals, textbooks
academic papers and high-quality books
structured domain corpora (standards, regulations, product catalogs, internal docs)
domain news and updates when timeliness matters

The key is not just “domain-related,” but accurate and representative. CPT will happily learn your domain’s misconceptions if you feed them consistently.

5) Catastrophic forgetting: why general capability drops and what to do

When you train only on domain data, you shift the model’s distribution. It becomes more confident in domain continuations and less calibrated elsewhere. This looks like forgetting.

Mitigation is conceptually simple: mix general data during domain training to preserve coverage. A common practical heuristic is to blend domain and general data rather than training on domain alone. If domain data is scarce, domain:general ratios like 1:5 to 1:10 are often used as a starting point.

Beyond mixing, you can reduce forgetting by:

lowering learning rate
limiting which layers you update (PEFT / LoRA)
using replay buffers of general instruction data during later-stage tuning
evaluating continuously on a general benchmark suite to catch regressions early

The most important point: you don’t “solve forgetting” once. You manage it like drift in production systems.

6) How to help the model learn more knowledge during CPT

If your goal is knowledge uptake, CPT generally comes before SFT. CPT expands the model’s internal distribution; SFT teaches it how to respond.

One practical pattern is mixing instruction-like data into CPT (sometimes called multi-task instruction pretraining). The reason is subtle: models don’t just need domain facts—they need to learn how domain questions are asked, how answers are structured, and what constraints matter. Exposing that early makes later SFT easier and more stable.

7) Chat model vs base model for SFT

This is a resource question, not a moral one.

If you have limited data (especially <10k examples), starting from a chat-tuned model usually gives better behavior quickly because alignment is already baked in. If you have substantial domain instruction data (~100k or more) and enough compute, starting from a base model can yield a cleaner fit because you’re not fighting prior chat formatting biases.

The hidden trap: chat models often require strict adherence to their original prompt format. If you ignore that format, your training signal becomes inconsistent, and you get brittle outputs.

8) Input format requirements for instruction tuning

Instruction tuning is not just “prompt + answer.” It is prompt format + role structure + masking strategy.

If you fine-tune a chat model, your training examples should follow the original system/user/assistant conventions that the model expects. Otherwise, you induce a format shift that shows up as:

weird tone
broken refusal behavior
unstable multi-turn coherence
regression on general chat tasks

In multi-turn dialogue tuning, you typically compute loss only on assistant turns and mask user/system tokens. This is the same idea as your pretraining-vs-SFT distinction: SFT changes what contributes to the loss.

9) Domain evaluation: how to build datasets that actually work

Domain evaluation should be two-tiered:

Automatically scored items (multiple-choice, structured extraction, exact-match, unit-test style prompts). These enable fast iteration and regression testing.
Open-ended tasks that require human judgment (helpdesk-style answers, reasoning, long-form writing, safety behavior). These match reality.

A strong evaluation set also needs:

coverage of subtopics (not just the easiest, most frequent)
adversarial and boundary cases (ambiguous, incomplete, conflicting)
a strict training/test separation to avoid memorization

The purpose of evaluation is not to produce a single score. It is to detect trade-offs: domain gain vs general loss, fluency vs factuality, helpfulness vs safety.

10) Is vocabulary expansion necessary?

Vocabulary expansion primarily improves tokenization efficiency in domains with heavy jargon, mixed scripts, or specialized symbols. It can help in languages like Chinese where segmentation matters, but it often yields smaller gains than people expect unless your domain is extremely tokenization-hostile (chemical strings, code-mixed medical terms, etc.).

The bigger levers are still data quality, data mixture, and training stability.

11) How to train your own large model (high-level)

A practical blueprint many teams follow:

Domain-adaptive pretraining / continue pretraining on high-quality corpora
Instruction fine-tuning on carefully constructed instruction-response data
(Optional) preference alignment (RLHF / DPO-style) if you need strong adherence to human preference

LoRA/PEFT can be applied in both stages when compute is limited. Full-parameter training is reserved for teams with serious budgets because it is expensive and easier to destabilize.

12) Batch size: too small vs too large, and how to choose

Batch size is not just a throughput knob. It changes optimization dynamics.

Too small and your gradients are noisy. Training can still work, but updates are unstable and you may need more steps, careful LR tuning, and accumulation.

Too large and you eventually hit diminishing returns: batches become too similar, you get fewer updates per epoch, and performance can plateau or even degrade if LR schedules aren’t adjusted.

The modern practical approach is:

choose the largest batch you can afford after you have tuned learning rate and validated stability
use gradient accumulation to simulate larger batches when memory-limited
treat tokens-per-step as the real unit, not “batch size” alone

13) Optimizers and why Adam can spike

Adam/AdamW is widely used, but large-scale training can experience loss spikes or collapse-like behavior, especially when shallow layers stop updating for long periods and then suddenly receive large gradients. This can create a chain reaction in optimizer state and destabilize training.

Mitigations used in practice include:

lowering learning rate
careful warmup and decay schedules
gradient clipping
tuning epsilon (ϵ) and mixed precision scaling
freezing or shrinking gradients in sensitive layers (e.g., embeddings) in some regimes
resuming from a “healthy” checkpoint and replacing suspect data batches if a spike occurs

The key learning: stability issues are often optimizer-state dynamics, not “bad data” alone.

14) What does the model learn in SFT vs pretraining?

Pretraining teaches broad language modeling: it learns to predict the next token for every token.

SFT changes the training target by masking parts of the input. You feed the prompt and answer together, but compute loss only on the answer tokens. This trains the model to produce a response conditioned on an instruction, which is why it becomes more “helpful” without necessarily gaining new factual knowledge.

This framing helps you decide where to inject domain knowledge:

CPT for broad domain knowledge and language patterns
SFT for domain behaviors, task formats, and instruction-following

Practical takeaway

If you want a domain model that doesn’t regress, the winning recipe is rarely “more SFT.” It’s almost always:

better domain data → controlled CPT → careful SFT → continuous evaluation → conservative optimization.

The candidates who understand that pipeline—and can explain why—are the ones who can actually ship domain LLMs.

Fine-Tuning LLMs in the Real World

A learning-focused resource on memory budgeting, data strategy, catastrophic forgetting, and training stability

Why fine-tuning is harder than it looks

1) How much GPU memory do you need?

A useful rule: with Adam/AdamW, optimizer states dominate. In mixed precision training, you typically keep:

model weights (fp16/bf16)
gradients
optimizer moments (m, v) in fp32 or similar
activations (often the largest term when sequence length grows)

This is why a 7B model might run inference on a 16GB GPU but full fine-tune can demand far more, especially with long sequences and large batch.

Practical heuristics you’ll see in the wild:

7B full fine-tuning often becomes realistic around 16–20GB+ only under aggressive memory tactics (checkpointing, sharding, offload, smaller sequences).
For scale training setups, teams often use multi-GPU (FSDP/ZeRO) even for 7B because they want higher throughput and safer headroom.

2) Why an LLM can feel worse after SFT

The most common failure modes are boring and fixable:

Data distribution collapse: the model learns to respond in a single style because your instruction data is homogeneous.
Overfitting: too many near-duplicate tasks or too many examples per task.
Learning rate too large: you overwrite useful representations, especially in early layers.
Wrong base: instruction-tuning a base model with chat-style data (or the reverse) introduces format mismatch, which looks like “the model got dumber.”

A good mental model is: SFT is a steering wheel, not an engine rebuild. If you steer too hard with low-quality signals, you can drive off the road.

3) How to construct SFT instruction-tuning data

High-quality SFT data is representative, diverse, and balanced. The goal is not raw volume; it is coverage of behaviors you care about without letting one pattern dominate.

What good construction looks like in practice:

Pick a small set of representative task families (Q&A, extraction, rewrite, reasoning, multi-turn helpdesk, tool-style outputs, etc.) and ensure the dataset isn’t accidentally 80% of one family.
Keep each task family from becoming too large relative to the others. Over-weighting one task is a reliable way to create “my model only does this one thing well.”
Increase diversity through prompt variations: different tones, constraints, formats, lengths, and edge cases. This is where robustness comes from.
Use a clean separation between training and evaluation prompts. Leakage is the easiest way to fool yourself.

When candidates talk about SFT data, listen for whether they treat it as behavior shaping (correct) rather than “knowledge injection” (often incorrect).

4) Domain-specific continue pretraining: how to choose data

In many successful domain CPT efforts, the best sources are not random web pages. They are:

curated technical documents, manuals, textbooks
academic papers and high-quality books
structured domain corpora (standards, regulations, product catalogs, internal docs)
domain news and updates when timeliness matters

The key is not just “domain-related,” but accurate and representative. CPT will happily learn your domain’s misconceptions if you feed them consistently.

5) Catastrophic forgetting: why general capability drops and what to do

When you train only on domain data, you shift the model’s distribution. It becomes more confident in domain continuations and less calibrated elsewhere. This looks like forgetting.

Beyond mixing, you can reduce forgetting by:

lowering learning rate
limiting which layers you update (PEFT / LoRA)
using replay buffers of general instruction data during later-stage tuning
evaluating continuously on a general benchmark suite to catch regressions early

The most important point: you don’t “solve forgetting” once. You manage it like drift in production systems.

6) How to help the model learn more knowledge during CPT

If your goal is knowledge uptake, CPT generally comes before SFT. CPT expands the model’s internal distribution; SFT teaches it how to respond.

7) Chat model vs base model for SFT

This is a resource question, not a moral one.

The hidden trap: chat models often require strict adherence to their original prompt format. If you ignore that format, your training signal becomes inconsistent, and you get brittle outputs.

8) Input format requirements for instruction tuning

Instruction tuning is not just “prompt + answer.” It is prompt format + role structure + masking strategy.

If you fine-tune a chat model, your training examples should follow the original system/user/assistant conventions that the model expects. Otherwise, you induce a format shift that shows up as:

weird tone
broken refusal behavior
unstable multi-turn coherence
regression on general chat tasks

9) Domain evaluation: how to build datasets that actually work

Domain evaluation should be two-tiered:

Automatically scored items (multiple-choice, structured extraction, exact-match, unit-test style prompts). These enable fast iteration and regression testing.
Open-ended tasks that require human judgment (helpdesk-style answers, reasoning, long-form writing, safety behavior). These match reality.

A strong evaluation set also needs:

coverage of subtopics (not just the easiest, most frequent)
adversarial and boundary cases (ambiguous, incomplete, conflicting)
a strict training/test separation to avoid memorization

The purpose of evaluation is not to produce a single score. It is to detect trade-offs: domain gain vs general loss, fluency vs factuality, helpfulness vs safety.

10) Is vocabulary expansion necessary?

The bigger levers are still data quality, data mixture, and training stability.

11) How to train your own large model (high-level)

A practical blueprint many teams follow:

Domain-adaptive pretraining / continue pretraining on high-quality corpora
Instruction fine-tuning on carefully constructed instruction-response data
(Optional) preference alignment (RLHF / DPO-style) if you need strong adherence to human preference

LoRA/PEFT can be applied in both stages when compute is limited. Full-parameter training is reserved for teams with serious budgets because it is expensive and easier to destabilize.

12) Batch size: too small vs too large, and how to choose

Batch size is not just a throughput knob. It changes optimization dynamics.

Too small and your gradients are noisy. Training can still work, but updates are unstable and you may need more steps, careful LR tuning, and accumulation.

Too large and you eventually hit diminishing returns: batches become too similar, you get fewer updates per epoch, and performance can plateau or even degrade if LR schedules aren’t adjusted.

The modern practical approach is:

choose the largest batch you can afford after you have tuned learning rate and validated stability
use gradient accumulation to simulate larger batches when memory-limited
treat tokens-per-step as the real unit, not “batch size” alone

13) Optimizers and why Adam can spike

Mitigations used in practice include:

lowering learning rate
careful warmup and decay schedules
gradient clipping
tuning epsilon (ϵ) and mixed precision scaling
freezing or shrinking gradients in sensitive layers (e.g., embeddings) in some regimes
resuming from a “healthy” checkpoint and replacing suspect data batches if a spike occurs

The key learning: stability issues are often optimizer-state dynamics, not “bad data” alone.

14) What does the model learn in SFT vs pretraining?

Pretraining teaches broad language modeling: it learns to predict the next token for every token.

This framing helps you decide where to inject domain knowledge:

CPT for broad domain knowledge and language patterns
SFT for domain behaviors, task formats, and instruction-following

Practical takeaway

If you want a domain model that doesn’t regress, the winning recipe is rarely “more SFT.” It’s almost always:

better domain data → controlled CPT → careful SFT → continuous evaluation → conservative optimization.

The candidates who understand that pipeline—and can explain why—are the ones who can actually ship domain LLMs.

Large Language Models (LLMs) – Fine-Tuning (9)

Quick Overview

Fine-Tuning LLMs in the Real World

Why fine-tuning is harder than it looks

1) How much GPU memory do you need?

2) Why an LLM can feel worse after SFT

3) How to construct SFT instruction-tuning data

4) Domain-specific continue pretraining: how to choose data

5) Catastrophic forgetting: why general capability drops and what to do

6) How to help the model learn more knowledge during CPT

7) Chat model vs base model for SFT

8) Input format requirements for instruction tuning

9) Domain evaluation: how to build datasets that actually work

10) Is vocabulary expansion necessary?

11) How to train your own large model (high-level)

12) Batch size: too small vs too large, and how to choose

13) Optimizers and why Adam can spike

14) What does the model learn in SFT vs pretraining?

Practical takeaway

Comments (0)

Large Language Models (LLMs) – Fine-Tuning (9)

Quick Overview

Fine-Tuning LLMs in the Real World

Why fine-tuning is harder than it looks

1) How much GPU memory do you need?

2) Why an LLM can feel worse after SFT

3) How to construct SFT instruction-tuning data

4) Domain-specific continue pretraining: how to choose data

5) Catastrophic forgetting: why general capability drops and what to do

6) How to help the model learn more knowledge during CPT

7) Chat model vs base model for SFT

8) Input format requirements for instruction tuning

9) Domain evaluation: how to build datasets that actually work

10) Is vocabulary expansion necessary?

11) How to train your own large model (high-level)

12) Batch size: too small vs too large, and how to choose

13) Optimizers and why Adam can spike

14) What does the model learn in SFT vs pretraining?

Practical takeaway

Comments (0)