DPO-Shift: Shifting the Distribution of Direct Preference Optimization

Xiliang Yang, Feng Jiang, Qianen Zhang, Lei Zhao, Xiao Li

2025-02-13

DPO-Shift: Shifting the Distribution of Direct Preference Optimization

Summary

This paper talks about DPO-Shift, a new method to improve how AI language models learn from human preferences. It's like teaching a smart computer to make better choices based on what people like, while fixing a problem with the current teaching method.

What's the problem?

The current way of teaching AI to understand human preferences, called Direct Preference Optimization (DPO), has a weird side effect. As the AI learns to tell good responses from bad ones, it sometimes becomes less likely to give the good responses. It's like a student who learns to recognize the right answers but becomes hesitant to use them.

What's the solution?

The researchers created DPO-Shift, which allows them to control how much the AI favors the preferred responses. They found a way to balance making the AI more likely to give good answers while still keeping it good at telling the difference between good and bad responses. They tested their method and showed it works better than the original DPO on various tasks.

Why it matters?

This matters because it helps make AI assistants and language models more reliable and better at giving the kind of responses people actually want. By solving this problem, AI could become more helpful and trustworthy in various applications, from chatbots to writing assistants. It's a step towards creating AI that not only understands what we prefer but is also more likely to act on those preferences consistently.

Abstract

Direct Preference Optimization (DPO) and its variants have become increasingly popular for aligning language models with human preferences. These methods aim to teach models to better distinguish between chosen (or preferred) and rejected (or dispreferred) responses. However, prior research has identified that the probability of chosen responses often decreases during training, and this phenomenon is known as likelihood displacement. To tackle this challenge, in this work we introduce \method to controllably shift the distribution of the chosen probability. Then, we show that \method exhibits a fundamental trade-off between improving the chosen probability and sacrificing the reward margin, as supported by both theoretical analysis and experimental validation. Furthermore, we demonstrate the superiority of \method over DPO on downstream tasks such as MT-Bench and a designed win rate experiment. We believe this study shows that the likelihood displacement issue of DPO can be effectively mitigated with a simple, theoretically grounded solution. Our code is available at https://github.com/Meaquadddd/DPO-Shift.

View Paper