Pre-DPO: Improving Data Utilization in Direct Preference Optimization Using a Guiding Reference Model

Junshu Pan, Wei Shen, Shulin Huang, Qiji Zhou, Yue Zhang

2025-04-24

Pre-DPO: Improving Data Utilization in Direct Preference Optimization
Using a Guiding Reference Model

Summary

This paper talks about Pre-DPO, a new method for training large language models to better understand and follow human preferences by making smarter use of the data during training.

What's the problem?

The problem is that when teaching AI models to act in ways people like, a lot of valuable training information often gets ignored or wasted, which means the models don't learn as efficiently or perform as well as they could.

What's the solution?

The researchers introduced Pre-DPO, which uses a guiding reference model to help the main AI model focus on the most useful parts of the training data. This approach allows the model to learn more from each example and get better at following human preferences, especially in situations where feedback is limited or noisy.

Why it matters?

This matters because it leads to smarter, more reliable AI systems that better understand what people want, making them more helpful and trustworthy in real-world applications.

Abstract

Pre-DPO enhances preference optimization in RLHF for LLMs by using a guiding reference model to improve data utilization and performance.

View Paper