If You Can't Use Them, Recycle Them: Optimizing Merging at Scale Mitigates Performance Tradeoffs
Muhammad Khalifa, Yi-Chern Tan, Arash Ahmadian, Tom Hosking, Honglak Lee, Lu Wang, Ahmet Üstün, Tom Sherborne, Matthias Gallé
2024-12-10

Summary
This paper talks about a new method for merging multiple AI models into one more powerful model, which can improve performance on various tasks without needing to start from scratch.
What's the problem?
When developing AI models, especially large language models (LLMs), researchers often create many different versions that are trained on various tasks. However, some of these models may not perform well enough on their own, and discarding them wastes valuable resources. Additionally, merging 'generalist' models that are designed for a wide range of tasks can be tricky because it's unclear if combining them will actually improve performance.
What's the solution?
The authors propose a new approach to model merging that focuses on recycling existing model checkpoints—essentially snapshots of the model's state at different training stages. They developed an optimization algorithm that combines these checkpoints in a way that maximizes performance across different tasks. By carefully adjusting how much each checkpoint contributes to the final model, they create a 'Pareto-optimal' model that performs better than any individual model or simple merges. Their experiments showed improvements in performance by up to 10% and made the merging process more efficient.
Why it matters?
This research is important because it provides a way to make better use of existing AI models, saving time and resources while enhancing their capabilities. By improving how models are merged, this approach can lead to the development of more powerful and versatile AI systems, which can be applied in various fields such as natural language processing, computer vision, and more.
Abstract
Model merging has shown great promise at combining expert models, but the benefit of merging is unclear when merging ``generalist'' models trained on many tasks. We explore merging in the context of large (sim100B) models, by recycling checkpoints that exhibit tradeoffs among different tasks. Such checkpoints are often created in the process of developing a frontier model, and many suboptimal ones are usually discarded. Given a pool of model checkpoints obtained from different training runs (e.g., different stages, objectives, hyperparameters, and data mixtures), which naturally show tradeoffs across different language capabilities (e.g., instruction following vs. code generation), we investigate whether merging can recycle such suboptimal models into a Pareto-optimal one. Our optimization algorithm tunes the weight of each checkpoint in a linear combination, resulting in a Pareto-optimal models that outperforms both individual models and merge-based baselines. Further analysis shows that good merges tend to include almost all checkpoints with with non-zero weights, indicating that even seemingly bad initial checkpoints can contribute to good final merges.