WARP: On the Benefits of Weight Averaged Rewarded Policies
Alexandre Ramé, Johan Ferret, Nino Vieillard, Robert Dadashi, Léonard Hussenot, Pierre-Louis Cedoz, Pier Giuseppe Sessa, Sertan Girgin, Arthur Douillard, Olivier Bachem
2024-06-25

Summary
This paper discusses a new method called Weight Averaged Rewarded Policies (WARP) that improves how large language models (LLMs) learn from human feedback. It focuses on balancing the need to keep the model's existing knowledge while also optimizing its performance based on rewards.
What's the problem?
When training LLMs using reinforcement learning from human feedback (RLHF), there's a challenge in maintaining the model's pre-trained knowledge while also trying to improve its performance. The standard approach uses a method called KL regularization, which keeps the model close to its original state but can limit how well it learns from new rewards. This creates a trade-off between retaining old knowledge and optimizing for better performance.
What's the solution?
The authors introduce WARP as a solution to this problem. WARP works by merging different versions of the model's policies in three steps: First, it uses an exponential moving average of the policy weights to help guide the learning process. Second, it combines various fine-tuned policies into a stronger one using spherical interpolation. Finally, it blends this new model with the original pre-trained model to recover useful features. This process is repeated iteratively, which helps refine the model's ability to balance reward optimization with maintaining its foundational knowledge.
Why it matters?
This research is significant because it provides a new way to enhance the training of language models, making them more effective at understanding and generating human-like responses. By improving how these models learn from feedback while retaining their pre-existing knowledge, WARP could lead to more reliable and capable AI systems in various applications.
Abstract
Reinforcement learning from human feedback (RLHF) aligns large language models (LLMs) by encouraging their generations to have high rewards, using a reward model trained on human preferences. To prevent the forgetting of pre-trained knowledge, RLHF usually incorporates a KL regularization; this forces the policy to remain close to its supervised fine-tuned initialization, though it hinders the reward optimization. To tackle the trade-off between KL and reward, in this paper we introduce a novel alignment strategy named Weight Averaged Rewarded Policies (WARP). WARP merges policies in the weight space at three distinct stages. First, it uses the exponential moving average of the policy as a dynamic anchor in the KL regularization. Second, it applies spherical interpolation to merge independently fine-tuned policies into a new enhanced one. Third, it linearly interpolates between this merged model and the initialization, to recover features from pre-training. This procedure is then applied iteratively, with each iteration's final model used as an advanced initialization for the next, progressively refining the KL-reward Pareto front, achieving superior rewards at fixed KL. Experiments with GEMMA policies validate that WARP improves their quality and alignment, outperforming other open-source LLMs.