RePO: ReLU-based Preference Optimization
Junkang Wu, Kexin Huang, Xue Wang, Jinyang Gao, Bolin Ding, Jiancan Wu, Xiangnan He, Xiang Wang
2025-03-11

Summary
This paper talks about RePO, a simpler method to teach AI language models to follow human preferences by using a smart math trick (ReLU) to focus on important differences between good and bad answers.
What's the problem?
Existing methods for training AI to match human preferences are either too complicated (using multiple settings) or unstable, making them slow and hard to use.
What's the solution?
RePO simplifies the process by using ReLU (a common AI math tool) to spot clear differences between good and bad answers, cutting out extra steps and needing just one setting to adjust.
Why it matters?
This makes AI models safer and more helpful in real-world apps (like chatbots) by training them faster and more reliably to pick answers humans prefer.
Abstract
Aligning large language models (LLMs) with human preferences is critical for real-world deployment, yet existing methods like RLHF face computational and stability challenges. While DPO establishes an offline paradigm with single hyperparameter beta, subsequent methods like SimPO reintroduce complexity through dual parameters (beta, gamma). We propose {ReLU-based Preference Optimization (RePO)}, a streamlined algorithm that eliminates beta via two advances: (1) retaining SimPO's reference-free margins but removing beta through gradient analysis, and (2) adopting a ReLU-based max-margin loss that naturally filters trivial pairs. Theoretically, RePO is characterized as SimPO's limiting case (beta to infty), where the logistic weighting collapses to binary thresholding, forming a convex envelope of the 0-1 loss. Empirical results on AlpacaEval 2 and Arena-Hard show that RePO outperforms DPO and SimPO across multiple base models, requiring only one hyperparameter to tune.