Finding the Sweet Spot: Preference Data Construction for Scaling Preference Optimization

Yao Xiao, Hai Ye, Linyao Chen, Hwee Tou Ng, Lidong Bing, Xiaoli Li, Roy Ka-wei Lee

2025-02-26

Finding the Sweet Spot: Preference Data Construction for Scaling
Preference Optimization

Summary

This paper talks about a new way to improve how AI language models learn from human preferences, especially when dealing with lots of data

What's the problem?

Current methods for teaching AI to follow human preferences work well with small amounts of data, but they start to perform worse when you increase the amount of data used. This is because they're not choosing the right examples to learn from when they have too many options

What's the solution?

The researchers found a better way to pick examples for the AI to learn from. Instead of always choosing the best and worst examples, they discovered that picking a 'pretty bad' example (but not the absolute worst) as the rejected option works better. They used math to figure out exactly where this sweet spot is, and created a method that keeps working well even with lots of data

Why it matters?

This matters because it helps AI language models get better at understanding and following human preferences, even when working with large amounts of data. This could lead to AI assistants that are more helpful and better at understanding what humans want, which is important as AI becomes more common in our daily lives

Abstract

Iterative data generation and model retraining are widely used to align large language models (LLMs). It typically involves a policy model to generate on-policy responses and a reward model to guide training data selection. Direct Preference Optimization (DPO) further enhances this process by constructing preference pairs of chosen and rejected responses. In this work, we aim to scale up the number of on-policy samples via repeated random sampling to improve alignment performance. Conventional practice selects the sample with the highest reward as chosen and the lowest as rejected for DPO. However, our experiments reveal that this strategy leads to a decline in performance as the sample size increases. To address this, we investigate preference data construction through the lens of underlying normal distribution of sample rewards. We categorize the reward space into seven representative points and systematically explore all 21 (C_7^2) pairwise combinations. Through evaluations on four models using AlpacaEval 2, we find that selecting the rejected response at reward position mu - 2sigma rather than the minimum reward, is crucial for optimal performance. We finally introduce a scalable preference data construction strategy that consistently enhances model performance as the sample scale increases.

View Paper