You are working on a project to fine-tune a large language model (LLM) using Direct Preference Optimization (DPO).
Answer the following:
-
Conceptual
: What is Direct Preference Optimization (DPO) at a high level, and how does it differ conceptually from a standard RLHF pipeline that uses PPO (Proximal Policy Optimization)? Focus on:
-
What objective DPO optimizes.
-
Why it can avoid training a separate reward model.
-
Practical benefits and trade-offs compared with PPO-based RLHF.
-
Data construction
: How would you construct a training dataset suitable for DPO when fine-tuning an LLM? Describe:
-
The format of one training example (what fields it contains).
-
How to collect or generate the
preferred
vs
dispreferred
responses.
-
How to handle noisy labels or ties.
-
Any preprocessing or filtering you would do to improve data quality.
Assume you start from a base SFT (supervised fine-tuned) model and you have the ability to collect either human preference data or model-generated preference data.