Token-Level LLM Collaboration via FusionRoute

Nuoya Xiong, Yuhang Zhou, Hanqing Zeng, Zhaorun Chen, Furong Huang, Shuchao Bi, Lizhu Zhang, Zhuokai Zhao

2026-01-09

Token-Level LLM Collaboration via FusionRoute

Summary

This paper introduces a new way to combine the strengths of large and small AI models, called FusionRoute, to get better performance without the huge costs of training one massive model.

What's the problem?

Large AI models are really good at many things, but training and using them is incredibly expensive. Smaller, specialized models are cheaper, but they can only do well on the specific tasks they were trained for and struggle with anything new. It's a trade-off between capability and cost, and existing methods for combining models aren't fully effective because they limit the potential for optimal performance.

What's the solution?

FusionRoute works by having a 'router' that, for each step in generating text, decides which AI model ('expert') is best suited for the job. But it doesn't just rely on the expert's answer. It *also* adds a little bit of its own input to refine or correct the expert's prediction. The researchers proved that simply picking an expert isn't always the best strategy, and adding this extra refinement step significantly improves results. They tested it with different models and on tasks like math, coding, and following instructions.

Why it matters?

This research is important because it offers a practical way to build powerful AI systems that are more affordable and adaptable. Instead of needing one giant model, you can use a team of smaller, specialized models working together, guided by FusionRoute, to achieve top-notch results across a variety of tasks, even matching the performance of models specifically designed for those tasks.

Abstract

Large language models (LLMs) exhibit strengths across diverse domains. However, achieving strong performance across these domains with a single general-purpose model typically requires scaling to sizes that are prohibitively expensive to train and deploy. On the other hand, while smaller domain-specialized models are much more efficient, they struggle to generalize beyond their training distributions. To address this dilemma, we propose FusionRoute, a robust and effective token-level multi-LLM collaboration framework in which a lightweight router simultaneously (i) selects the most suitable expert at each decoding step and (ii) contributes a complementary logit that refines or corrects the selected expert's next-token distribution via logit addition. Unlike existing token-level collaboration methods that rely solely on fixed expert outputs, we provide a theoretical analysis showing that pure expert-only routing is fundamentally limited: unless strong global coverage assumptions hold, it cannot in general realize the optimal decoding policy. By augmenting expert selection with a trainable complementary generator, FusionRoute expands the effective policy class and enables recovery of optimal value functions under mild conditions. Empirically, across both Llama-3 and Gemma-2 families and diverse benchmarks spanning mathematical reasoning, code generation, and instruction following, FusionRoute outperforms both sequence- and token-level collaboration, model merging, and direct fine-tuning, while remaining competitive with domain experts on their respective tasks.

View Paper