Margin Adaptive DPO: Leveraging Reward Model for Granular Control in Preference Optimization

Hyung Gyu Rho

2025-10-08

Margin Adaptive DPO: Leveraging Reward Model for Granular Control in Preference Optimization

Summary

This paper introduces a new method called Margin-Adaptive Direct Preference Optimization, or MADPO, which improves how we train large language models to better follow human preferences.

What's the problem?

Currently, a popular technique called Direct Preference Optimization (DPO) uses a single setting, called a temperature, to control how much the model learns from different examples. This doesn't work well because some examples are easy for the model to get right, and it overlearns from those, while other examples are harder and it doesn't learn enough. Existing attempts to fix this either overcorrect, are unstable, or throw away potentially useful information.

What's the solution?

MADPO solves this by first training a separate 'reward model' to figure out *how* different the model's responses are from what a human would prefer. Then, it uses this information to adjust how much the model learns from each individual example. Harder examples, where the model is further from the preferred response, get more weight in the learning process, while easier examples get less. This happens on a case-by-case basis, making it more precise and stable than previous methods.

Why it matters?

This is important because it makes training these large language models more efficient and effective. MADPO consistently performs better than other methods, especially when the training data isn't perfect, leading to models that are better aligned with what people actually want and expect from them. It's a more reliable way to build AI that behaves as intended.

Abstract

Direct Preference Optimization (DPO) has emerged as a simple and effective method for aligning large language models. However, its reliance on a fixed temperature parameter leads to suboptimal training on diverse preference data, causing overfitting on easy examples and under-learning from informative ones. Recent methods have emerged to counter this. While IPO addresses general overfitting, its uniform regularization can be overly conservative. The more targeted approach of beta-DPO suffers from its own limitations: its batch-level adaptation applies a single, compromised temperature to mixed-margin pairs, its linear update rule can produce unstable negative beta values, and its filtering mechanism discards potentially useful training signals. In this work, we introduce Margin-Adaptive Direct Preference Optimization (MADPO), a method that provides a stable, data-preserving, and instance-level solution. MADPO employs a practical two-step approach: it first trains a reward model to estimate preference margins and then uses these margins to apply a continuous, adaptive weight to the DPO loss for each individual training sample. This re-weighting scheme creates an effective target margin that is amplified for hard pairs and dampened for easy pairs, allowing for granular control over the learning signal. We provide a comprehensive theoretical analysis, proving that MADPO has a well-behaved optimization landscape and is robust to reward model estimation errors. We validate our theory with experiments on a sentiment generation task, where MADPO consistently and significantly outperforms strong baselines across datasets of varying quality. It achieves performance gains of up to +33.3\% on High Quality data and +10.5\% on Low Quality data over the next-best method. Our results establish MADPO as a more robust and principled approach to preference alignment.

View Paper