This question evaluates understanding of Direct Preference Optimization (DPO) for fine-tuning large language models, assessing conceptual differences from PPO-based RLHF and the competency to design pairwise preference training datasets.
You are working on a project to fine-tune a large language model (LLM) using Direct Preference Optimization (DPO).
Answer the following:
Assume you start from a base SFT (supervised fine-tuned) model and you have the ability to collect either human preference data or model-generated preference data.
Login required