Accelerated Preference Optimization for Large Language Model Alignment

Jiafan He, Huizhuo Yuan, Quanquan Gu

2024-10-13

Accelerated Preference Optimization for Large Language Model Alignment

Summary

This paper discusses a new method called Accelerated Preference Optimization (APO) that improves how large language models (LLMs) are trained to align with human preferences using reinforcement learning techniques.

What's the problem?

Aligning LLMs with human preferences is crucial for making them more useful and effective. Traditional methods like Direct Preference Optimization (DPO) can be slow and inefficient, as they often require multiple steps to estimate rewards and optimize the model. This can lead to stability issues and slow progress in training the models.

What's the solution?

The authors propose APO, which uses a technique called momentum to speed up the training process for aligning LLMs. They show that the iterative preference optimization method can be viewed as a proximal point method, allowing them to apply Nesterov's momentum technique. This approach helps the model learn faster and more effectively by improving how it processes information during training. The results demonstrate that APO outperforms traditional methods like DPO in terms of speed and effectiveness on various benchmarks.

Why it matters?

This research is important because it offers a more efficient way to train LLMs, making them better at understanding and responding to human preferences. By improving the alignment process, APO could lead to more advanced AI systems that are better at tasks like conversation, content generation, and other applications where understanding human intent is essential.

Abstract

Reinforcement Learning from Human Feedback (RLHF) has emerged as a pivotal tool for aligning large language models (LLMs) with human preferences. Direct Preference Optimization (DPO), one of the most popular approaches, formulates RLHF as a policy optimization problem without explicitly estimating the reward function. It overcomes the stability and efficiency issues of two-step approaches, which typically involve first estimating the reward function and then optimizing the policy via proximal policy optimization (PPO). Since RLHF is essentially an optimization problem, and it is well-known that momentum techniques can accelerate optimization both theoretically and empirically, a natural question arises: Can RLHF be accelerated by momentum? This paper answers this question in the affirmative. In detail, we first show that the iterative preference optimization method can be viewed as a proximal point method. Based on this observation, we propose a general Accelerated Preference Optimization (APO) framework, which unifies many existing preference optimization algorithms and employs Nesterov's momentum technique to speed up the alignment of LLMs. Theoretically, we demonstrate that APO can achieve a faster convergence rate than the standard iterative preference optimization methods, including DPO and Self-Play Preference Optimization (SPPO). Empirically, we show the superiority of APO over DPO, iterative DPO, and other strong baselines for RLHF on the AlpacaEval 2.0 benchmark.

View Paper