What Matters for Model Merging at Scale?
Prateek Yadav, Tu Vu, Jonathan Lai, Alexandra Chronopoulou, Manaal Faruqui, Mohit Bansal, Tsendsuren Munkhdalai
2024-10-08

Summary
This paper discusses model merging, which is the process of combining multiple specialized models into one larger model to improve performance and reduce costs. The authors explore how different factors affect the success of this merging process, especially when scaling up the size of the models.
What's the problem?
While merging models can offer benefits like improved efficiency and better performance, most previous research has focused on merging small models. This leaves many questions unanswered about how larger models behave when merged and how factors like the quality of the base models and the number of expert models influence the merged model's performance. Without understanding these dynamics, it’s hard to know how to effectively combine models at a larger scale.
What's the solution?
To investigate these issues, the authors conducted experiments using four popular merging methods on models ranging from 1 billion to 64 billion parameters. They merged up to eight different expert models and evaluated their performance on both familiar tasks and new, unseen tasks. Their findings revealed several important insights: merging works better when starting with strong base models, larger models are easier to merge, and merging generally improves the model's ability to generalize to new tasks. They also found that different merging methods performed similarly when applied to larger models.
Why it matters?
This research is significant because it provides valuable insights into how to effectively merge large language models, which is crucial for advancing AI technology. By understanding what factors contribute to successful model merging, researchers and developers can create more powerful AI systems that leverage the strengths of multiple specialized models, making them more efficient and capable in real-world applications.
Abstract
Model merging aims to combine multiple expert models into a more capable single model, offering benefits such as reduced storage and serving costs, improved generalization, and support for decentralized model development. Despite its promise, previous studies have primarily focused on merging a few small models. This leaves many unanswered questions about the effect of scaling model size and how it interplays with other key factors -- like the base model quality and number of expert models -- , to affect the merged model's performance. This work systematically evaluates the utility of model merging at scale, examining the impact of these different factors. We experiment with merging fully fine-tuned models using 4 popular merging methods -- Averaging, Task~Arithmetic, Dare, and TIES -- across model sizes ranging from 1B-64B parameters and merging up to 8 different expert models. We evaluate the merged models on both held-in tasks, i.e., the expert's training tasks, and zero-shot generalization to unseen held-out tasks. Our experiments provide several new insights about model merging at scale and the interplay between different factors. First, we find that merging is more effective when experts are created from strong base models, i.e., models with good zero-shot performance. Second, larger models facilitate easier merging. Third merging consistently improves generalization capabilities. Notably, when merging 8 large expert models, the merged models often generalize better compared to the multitask trained models. Fourth, we can better merge more expert models when working with larger models. Fifth, different merging methods behave very similarly at larger scales. Overall, our findings shed light on some interesting properties of model merging while also highlighting some limitations. We hope that this study will serve as a reference point on large-scale merging for upcoming research.