Rethinking Mixture-of-Agents: Is Mixing Different Large Language Models Beneficial?

Wenzhe Li, Yong Lin, Mengzhou Xia, Chi Jin

2025-02-05

Rethinking Mixture-of-Agents: Is Mixing Different Large Language Models
Beneficial?

Summary

This paper talks about whether combining the outputs of different large language models (LLMs) actually improves performance. It introduces a new method called Self-MoA, which only uses the outputs of the best-performing single model, and shows that this approach works better than mixing multiple models in many cases.

What's the problem?

Mixing the outputs of different LLMs, a method known as Mixture-of-Agents (MoA), is commonly used to improve results. However, this approach can sometimes reduce overall quality because not all models perform equally well, and combining them might lower the average performance.

What's the solution?

The researchers created Self-MoA, a method that focuses on using only the best-performing model instead of mixing several. They tested it across various benchmarks and found that it consistently outperformed the traditional MoA approach. They also studied when mixing models might still be useful and introduced a version of Self-MoA that can handle multiple rounds of outputs efficiently.

Why it matters?

This research is important because it challenges the idea that mixing multiple models is always better. By showing that focusing on the best model can lead to better results, it simplifies how LLMs are used and helps achieve state-of-the-art performance in tasks like reasoning and problem-solving.

Abstract

Ensembling outputs from diverse sources is a straightforward yet effective approach to boost performance. Mixture-of-Agents (MoA) is one such popular ensemble method that aggregates outputs from multiple different Large Language Models (LLMs). This paper raises the question in the context of language models: is mixing different LLMs truly beneficial? We propose Self-MoA -- an ensemble method that aggregates outputs from only the single top-performing LLM. Our extensive experiments reveal that, surprisingly, Self-MoA outperforms standard MoA that mixes different LLMs in a large number of scenarios: Self-MoA achieves 6.6% improvement over MoA on the AlpacaEval 2.0 benchmark, and an average of 3.8% improvement across various benchmarks, including MMLU, CRUX, and MATH. Applying Self-MoA to one of the top-ranking models in AlpacaEval 2.0 directly achieves the new state-of-the-art performance on the leaderboard. To understand the effectiveness of Self-MoA, we systematically investigate the trade-off between diversity and quality of outputs under various MoA settings. We confirm that the MoA performance is rather sensitive to the quality, and mixing different LLMs often lowers the average quality of the models. To complement the study, we identify the scenarios where mixing different LLMs could be helpful. This paper further introduces a sequential version of Self-MoA, that is capable of aggregating a large number of LLM outputs on-the-fly over multiple rounds, and is as effective as aggregating all outputs at once.

View Paper