Souper-Model: How Simple Arithmetic Unlocks State-of-the-Art LLM Performance

Shalini Maiti, Amar Budhiraja, Bhavul Gauri, Gaurav Chaurasia, Anton Protopopov, Alexis Audran-Reiss, Michael Slater, Despoina Magka, Tatiana Shavrina, Roberta Raileanu, Yoram Bachrach

2025-11-18

Souper-Model: How Simple Arithmetic Unlocks State-of-the-Art LLM Performance

Summary

This paper explores a way to improve large language models, like the ones powering chatbots, without needing to spend a huge amount of time and money retraining them from scratch.

What's the problem?

Training these powerful language models is incredibly expensive and takes a lot of computing resources. While simply combining existing models (called 'model souping') can help, previous methods treated all models as equally valuable, which isn't always the case. Different models tend to excel at different types of tasks, and a simple average doesn't take that into account.

What's the solution?

The researchers developed a new method called 'Soup Of Category Experts' or SoCE. Instead of averaging all models equally, SoCE identifies which models are best at specific categories of tasks – like math, language translation, or using tools – and then gives those 'expert' models more weight when combining them. This is done by analyzing how well models perform on different types of benchmarks and grouping them accordingly.

Why it matters?

This research is important because it offers a more efficient way to improve language models. By intelligently combining existing models, we can achieve better performance and reliability without the massive cost of full retraining, making these powerful technologies more accessible and practical for a wider range of applications.

Abstract

Large Language Models (LLMs) have demonstrated remarkable capabilities across diverse domains, but their training remains resource- and time-intensive, requiring massive compute power and careful orchestration of training procedures. Model souping-the practice of averaging weights from multiple models of the same architecture-has emerged as a promising pre- and post-training technique that can enhance performance without expensive retraining. In this paper, we introduce Soup Of Category Experts (SoCE), a principled approach for model souping that utilizes benchmark composition to identify optimal model candidates and applies non-uniform weighted averaging to maximize performance. Contrary to previous uniform-averaging approaches, our method leverages the observation that benchmark categories often exhibit low inter-correlations in model performance. SoCE identifies "expert" models for each weakly-correlated category cluster and combines them using optimized weighted averaging rather than uniform weights. We demonstrate that the proposed method improves performance and robustness across multiple domains, including multilingual capabilities, tool calling, and math and achieves state-of-the-art results on the Berkeley Function Calling Leaderboard.

View Paper