When to Ensemble: Identifying Token-Level Points for Stable and Fast LLM Ensembling
Heecheol Yun, Kwangmin Ki, Junghyun Lee, Eunho Yang
2025-10-21
Summary
This paper explores how to best combine the strengths of multiple large language models (LLMs) to generate better, longer pieces of text.
What's the problem?
While combining LLMs works well for short answers, it doesn't automatically translate to good results when generating long-form content like essays or stories. Simply averaging the models' predictions at *every* step actually makes the output worse. The issue is that different models sometimes disagree on what the next word should be, and constantly trying to resolve those disagreements can lead to unstable and lower-quality text. This disagreement stems from how the models break down words into smaller pieces (tokenization) and how confident they are in their predictions.
What's the solution?
The researchers developed a method called SAFE, which stands for Stable And Fast LLM Ensembling. Instead of combining predictions at every single word, SAFE *selectively* combines them only when it makes sense. It does this by looking at two things: how differently the models break down words and whether they generally agree on what the next word should be. They also added a technique to focus the probability on the whole word instead of spreading it across its smaller parts, making the process more stable. Essentially, SAFE only asks for help from other models when it needs it, and it does so in a smart way.
Why it matters?
This research is important because it shows us how to effectively use multiple LLMs together to create high-quality, long-form content. By being strategic about *when* and *how* to combine models, we can get better results than using any single model alone, and do so efficiently by only ensembling a small percentage of the generated tokens. This could lead to significant improvements in areas like automated writing, content creation, and even more advanced AI assistants.
Abstract
Ensembling Large Language Models (LLMs) has gained attention as a promising approach to surpass the performance of individual models by leveraging their complementary strengths. In particular, aggregating models' next-token probability distributions to select the next token has been shown to be effective in various tasks. However, while successful for short-form answers, its application to long-form generation remains underexplored. In this paper, we show that using existing ensemble methods in long-form generation requires a careful choice of ensembling positions, since the standard practice of ensembling at every token often degrades performance. We identify two key factors for determining these positions: tokenization mismatch across models and consensus in their next-token probability distributions. Based on this, we propose SAFE, (Stable And Fast LLM Ensembling), a framework that selectively ensembles by jointly considering these factors. To further improve stability, we introduce a probability sharpening strategy that consolidates probabilities spread across multiple sub-word tokens representing the same word into a single representative token. Our experiments on diverse benchmarks, including MATH500 and BBH, demonstrate that SAFE outperforms existing methods in both accuracy and efficiency, with gains achieved even when ensembling fewer than 1% of tokens.