Reinforcement Learning, RLHF, and the Evolution of Alignment Methods

A Learning-Oriented Resource for Understanding How Large Models Are Aligned

This post is written to help you build intuition, not just memorize terminology. The focus is on how reinforcement learning ideas are adapted to large language models, why RLHF emerged, where it breaks down in practice, and how newer methods simplify or replace it. Think of this as a map of how alignment thinking has evolved.

1. What Is Reinforcement Learning?

Reinforcement Learning (RL) is a learning paradigm centered on interaction and feedback. An agent takes actions in an environment, receives reward signals, and gradually learns a policy that maximizes long-term cumulative reward.

What distinguishes RL from supervised learning is that:

feedback is often delayed,
the correct action is not explicitly labeled,
learning is driven by trial, error, and evaluation.

At its core, RL is about adaptive decision-making under uncertainty. This framing becomes crucial when we apply RL concepts to language models, which do not act in physical environments but instead interact through text.

2. What Is RLHF and Why It Matters

RLHF—Reinforcement Learning from Human Feedback—adapts RL to language models by redefining the “environment” as human preference.

Instead of physical rewards, the model learns from signals such as:

which response humans prefer,
which answer is more helpful,
which output feels safer or more truthful.

The standard RLHF pipeline consists of three stages:

Supervised Fine-Tuning (SFT) to teach basic instruction-following behavior
Reward Model (RM) training to approximate human preference judgments
Policy optimization (often PPO) using the reward model as feedback

This framework played a key role in making GPT-3–era models more aligned with human expectations. Importantly, RLHF does not primarily teach new knowledge—it reshapes how existing knowledge is expressed.

3. Do the Reward Model and Base Model Need to Be the Same?

In theory, the reward model and the policy model can be different. In practice, many implementations impose constraints.

When both models share:

the same tokenizer,
the same vocabulary,
and similar architectural assumptions,

training becomes simpler and more stable. This is why PPO-based RLHF pipelines often select reward models from the same family as the base model.

This reflects a deeper principle: alignment is easier when representations are compatible. Mismatched tokenization or embedding spaces can introduce subtle failure modes.

4. The Practical Limitations of RLHF

Despite its success, RLHF has significant real-world costs.

Human preference data is expensive to collect, slow to iterate on, and difficult to scale. Each comparison requires time, judgment, and consistency.

The training pipeline itself is long. Running SFT, then RM training, then PPO means slow experimentation cycles and high operational complexity.

Compute cost is another bottleneck. PPO-based RLHF often involves:

a policy model,
a reference model,
a reward model,
and inference-time sampling models.

This can mean four large models active simultaneously, pushing memory and compute requirements beyond what many teams can afford.

5. Reducing Human Cost: AI Replacing Humans

One major direction of innovation focuses on replacing human feedback with model-generated feedback.

RLAIF: Reinforcement Learning from AI Feedback

RLAIF uses AI models as proxy annotators. Instead of humans judging outputs, another model evaluates and corrects responses.

During early stages, one model generates samples while another critiques them. These critiques are then used to fine-tune the base model. During later RL stages, an AI-trained reward model replaces human judgment entirely.

This approach dramatically improves scalability, though it introduces a new risk: feedback bias amplification, where models reinforce each other’s mistakes.

RRHF: Rank Response from Human Feedback

RRHF takes a different path. It removes reinforcement learning altogether.

Multiple candidate responses are generated, often by different models. Humans rank these responses by preference. A ranking-based loss is then used directly to fine-tune the model.

Interestingly, a model trained this way can function both as:

a generation model,
and a preference or reward model.

RRHF highlights an important insight: ranking is often enough. Explicit reward modeling is not always necessary.

6. Shortening the Pipeline: Data-Centric Alignment

Another major shift is moving away from complex training pipelines toward data quality optimization.

The core assumption is simple but powerful:

If the data is good enough, the model will align itself.

LIMA: Less Is More for Alignment

LIMA argues that most reasoning and knowledge are learned during pretraining. Alignment is primarily about reshaping output distribution, not teaching new skills.

By carefully curating a small, high-quality dataset, LIMA shows that supervised fine-tuning alone can achieve strong alignment—without RLHF.

This reframes alignment as a data selection problem, not an optimization problem.

“Only 0.5% Data Is Needed”

This idea pushes the same logic further. Instead of more data, focus on the most informative samples.

By identifying high-impact examples, models can achieve strong performance with a tiny fraction of the original dataset. This dramatically reduces training cost while preserving or even improving quality.

The broader lesson: not all data is equally valuable.

7. Reducing PPO Cost: Simplifying Training Objectives

A third direction focuses on eliminating PPO itself.

RAFT: Reward Ranked Fine-Tuning

RAFT combines reward supervision with ranking-based fine-tuning. Instead of running a full RL loop, it directly optimizes model outputs using ranked signals.

This preserves preference learning while avoiding PPO’s instability and compute overhead.

DPO: Direct Preference Optimization

DPO represents a more radical simplification.

It introduces a direct objective that:

uses pairwise preference comparisons,
removes the need for a separate reward model,
eliminates PPO entirely.

DPO reframes alignment as a pure optimization problem, not a reinforcement learning one. This greatly simplifies engineering while maintaining strong empirical performance.

Final Perspective

The evolution from RLHF to RLAIF, LIMA, RAFT, and DPO reflects a broader trend:

Alignment is moving from heavy reinforcement learning toward simpler, more data- and objective-driven methods.

The key insight is that large language models already possess vast capabilities. Alignment is less about teaching them how to think and more about shaping how they respond.

Understanding this shift will help you reason about future alignment techniques—many of which may not look like reinforcement learning at all.

LLMs 36. Large Language Models (LLMs) — Reinforcement Learning

Quick Overview