General Preference Modeling with Preference Representations for Aligning Language Models
Yifan Zhang, Ge Zhang, Yue Wu, Kangping Xu, Quanquan Gu
2024-10-03

Summary
This paper discusses a new method called General Preference Modeling, which aims to better align large language models (LLMs) with human preferences by improving how these models understand and respond to what people like.
What's the problem?
Current methods for modeling human preferences, like the Bradley-Terry reward model, struggle to accurately represent complex preferences, especially when preferences are not straightforward (like when someone prefers A over B, but B over C, and C over A). These traditional methods can also be inefficient because they require a lot of computational resources when comparing multiple responses. This makes it hard for LLMs to consistently provide answers that align with what users actually want.
What's the solution?
The authors introduce a new approach called preference representation learning, which organizes responses in a way that captures detailed preference structures while being more efficient. They also propose a method called General Preference Optimization (GPO), which enhances the learning process by using feedback from human preferences to improve the model's performance. Their experiments show that this new model outperforms older methods in accurately reflecting human preferences and improves the quality of responses generated by LLMs.
Why it matters?
This research is important because it helps make AI systems more aligned with what people actually want, leading to better interactions and more useful outputs. As AI becomes more integrated into everyday life, ensuring that these systems understand and respect human values is crucial for creating trustworthy and effective technology.
Abstract
Modeling human preferences is crucial for aligning foundation models with human values. Traditional reward modeling methods, such as the Bradley-Terry (BT) reward model, fall short in expressiveness, particularly in addressing intransitive preferences. Although supervised pair preference models (PairPM) can express general preferences, their implementation is highly ad-hoc and cannot guarantee a consistent preference probability of compared pairs. Additionally, they impose high computational costs due to their quadratic query complexity when comparing multiple responses. In this paper, we introduce preference representation learning, an approach that embeds responses into a latent space to capture intricate preference structures efficiently, achieving linear query complexity. Additionally, we propose preference score-based General Preference Optimization (GPO), which generalizes reward-based reinforcement learning from human feedback. Experimental results show that our General Preference representation model (GPM) outperforms the BT reward model on the RewardBench benchmark with a margin of up to 5.6% and effectively models cyclic preferences where any BT reward model behaves like a random guess. Furthermore, evaluations on downstream tasks such as AlpacaEval2.0 and MT-Bench, following the language model post-training with GPO and our general preference model, reveal substantial performance improvements with margins up to 9.3%. These findings indicate that our method may enhance the alignment of foundation models with nuanced human values. The code is available at https://github.com/general-preference/general-preference-model.