Explain DPO and construct its training data

Q: Explain DPO and construct its training data

This is a Machine Learning interview question from ByteDance for Software Engineer roles. View the full question and solution on PracHub.

Q: How do I approach Machine Learning interview questions?

Machine Learning questions require understanding of core concepts and practice. PracHub provides solutions with explanations to help you master machine learning interviews.

Question

You are working on a project to fine-tune a large language model (LLM) using Direct Preference Optimization (DPO).

Answer the following:

Conceptual : What is Direct Preference Optimization (DPO) at a high level, and how does it differ conceptually from a standard RLHF pipeline that uses PPO (Proximal Policy Optimization)? Focus on:
- What objective DPO optimizes.
- Why it can avoid training a separate reward model.
- Practical benefits and trade-offs compared with PPO-based RLHF.
Data construction : How would you construct a training dataset suitable for DPO when fine-tuning an LLM? Describe:
- The format of one training example (what fields it contains).
- How to collect or generate the preferred vs dispreferred responses.
- How to handle noisy labels or ties.
- Any preprocessing or filtering you would do to improve data quality.

Assume you start from a base SFT (supervised fine-tuned) model and you have the ability to collect either human preference data or model-generated preference data.

Explain DPO and construct its training data

Solution

Comments (0)