Weighted-Reward Preference Optimization for Implicit Model Fusion
Ziyi Yang, Fanqi Wan, Longguang Zhong, Tianyuan Shi, Xiaojun Quan
2024-12-05
Summary
This paper introduces Weighted-Reward Preference Optimization (WRPO), a new method for combining different large language models (LLMs) to enhance their performance without complex processes.
What's the problem?
When trying to merge various LLMs, which can have different structures and sizes, existing methods face challenges like aligning their vocabularies and merging their data distributions. These processes can be complicated and may introduce errors, making it difficult to effectively combine the strengths of each model.
What's the solution?
WRPO addresses these challenges by using an implicit fusion method that focuses on optimizing preferences between the source LLMs and the target LLM. This method eliminates the need for vocabulary alignment and complex matrix merging. Instead, it uses a progressive adaptation strategy that gradually shifts reliance from the target model to the source models, allowing for a smoother integration of their capabilities. This approach has been tested on various benchmarks and has shown to outperform existing methods.
Why it matters?
This research is important because it simplifies the process of combining different AI models, making it easier to create more powerful and versatile language models. By improving how these models work together, WRPO can lead to better performance in tasks like natural language understanding and generation, which are essential for applications in AI-driven communication, content creation, and more.
Abstract
While fusing heterogeneous open-source LLMs with varying architectures and sizes can potentially integrate the strengths of different models, existing fusion methods face significant challenges, such as vocabulary alignment and merging distribution matrices. These procedures are not only complex but also prone to introducing noise and errors. In this paper, we propose an implicit fusion method, Weighted-Reward Preference Optimization (WRPO), which leverages preference optimization between the source LLMs and the target LLM to transfer their capabilities effectively. WRPO eliminates the need for vocabulary alignment and matrix fusion and can be efficiently scaled to accommodate various LLMs. To address distributional deviations between the source and target LLMs, WRPO introduces a progressive adaptation strategy that gradually shifts reliance on preferred examples from the target LLM to the source LLMs. Extensive experiments on the MT-Bench, AlpacaEval-2, and Arena-Hard benchmarks demonstrate that WRPO consistently outperforms existing knowledge fusion methods and various fine-tuning baselines. When applied to LLaMA3-8B-Instruct as the target model, WRPO achieves a length-controlled win rate of 55.9% against GPT-4-Preview-1106 on AlpacaEval-2 and a win rate of 46.2% against GPT-4-0314 on Arena-Hard. Our code is available at https://github.com/SLIT-AI/WRPO.