SALSA: Soup-based Alignment Learning for Stronger Adaptation in RLHF

Atoosa Chegini, Hamid Kazemi, Iman Mirzadeh, Dong Yin, Maxwell Horton, Moin Nabi, Mehrdad Farajtabar, Keivan Alizadeh

2024-11-05

SALSA: Soup-based Alignment Learning for Stronger Adaptation in RLHF

Summary

This paper introduces SALSA, a new method designed to improve how large language models (LLMs) learn from human feedback. It uses a technique called 'model soup' to help these models adapt better and make smarter decisions.

What's the problem?

Traditional methods for training LLMs using Reinforcement Learning from Human Feedback (RLHF) rely on a fixed reference model. This approach limits the models' ability to explore new solutions and can lead to suboptimal performance. As a result, the models might not align well with human values or preferences, making them less effective in real-world applications.

What's the solution?

SALSA addresses these issues by creating a 'model soup,' which combines multiple independently trained models into one flexible system. This allows the LLMs to explore a wider range of actions and learn more effectively from human feedback. By averaging the weights of these models, SALSA provides a more adaptable reference point, enabling the LLMs to deviate more freely from their initial training without losing stability. The authors tested SALSA on various popular LLMs and found that it consistently outperformed traditional methods in terms of learning speed and quality of alignment with human preferences.

Why it matters?

This research is important because it enhances the ability of AI systems to adapt to human feedback, making them more reliable and effective in understanding and responding to user needs. By improving how LLMs learn from human input, SALSA could lead to better AI assistants and other technologies that align closely with human values, ultimately benefiting various fields such as customer service, education, and entertainment.

Abstract

In Large Language Model (LLM) development, Reinforcement Learning from Human Feedback (RLHF) is crucial for aligning models with human values and preferences. RLHF traditionally relies on the Kullback-Leibler (KL) divergence between the current policy and a frozen initial policy as a reference, which is added as a penalty in policy optimization algorithms like Proximal Policy Optimization (PPO). While this constraint prevents models from deviating too far from the initial checkpoint, it limits exploration of the reward landscape, reducing the model's ability to discover higher-quality solutions. As a result, policy optimization is often trapped in a narrow region of the parameter space, leading to suboptimal alignment and performance. This paper presents SALSA (Soup-based Alignment Learning for Stronger Adaptation), a novel approach designed to overcome these limitations by creating a more flexible and better located reference model through weight-space averaging of two independent supervised fine-tuned (SFT) models. This model soup allows for larger deviation in KL divergence and exploring a promising region of the solution space without sacrificing stability. By leveraging this more robust reference model, SALSA fosters better exploration, achieving higher rewards and improving model robustness, out-of-distribution generalization, and performance. We validate the effectiveness of SALSA through extensive experiments on popular open models (Llama2-7B, Mistral-7B, and Gemma-2B) across various benchmarks (MT-Bench, Arena-Hard, UltraFeedback), where it consistently surpasses PPO by fostering deeper exploration and achieving superior alignment in LLMs.

View Paper