RelayLLM: Efficient Reasoning via Collaborative Decoding
Chengsong Huang, Tong Zheng, Langlin Huang, Jinyuan Li, Haolin Liu, Jiaxin Huang
2026-01-09
Summary
This paper introduces a new way for smaller, faster AI models and larger, more powerful AI models to work together to solve complex problems, focusing on making the process more efficient.
What's the problem?
Typically, when you want an AI to do something complicated, you need a really big and powerful model, but those are slow and expensive to run. Smaller models are faster and cheaper, but they often can't handle the tricky parts of the problem. Existing methods that try to combine them just hand the whole problem off to the big model when the small one gets stuck, which wastes resources because the small model *can* often handle most of the work.
What's the solution?
The researchers created a system called RelayLLM where the smaller model stays in control and actively decides when it needs help from the larger model. Instead of passing the entire problem, it only asks for help with specific parts – individual 'tokens' or pieces of the answer – using a special 'command'. They also developed a training process to teach the smaller model when to ask for help and how to balance working independently with seeking assistance.
Why it matters?
This is important because RelayLLM significantly reduces the cost of using large AI models. It achieves nearly the same level of accuracy as simply using the large model all the time, but it only uses the large model for a tiny fraction of the work, resulting in a huge cost savings and making complex AI tasks more accessible.
Abstract
Large Language Models (LLMs) for complex reasoning is often hindered by high computational costs and latency, while resource-efficient Small Language Models (SLMs) typically lack the necessary reasoning capacity. Existing collaborative approaches, such as cascading or routing, operate at a coarse granularity by offloading entire queries to LLMs, resulting in significant computational waste when the SLM is capable of handling the majority of reasoning steps. To address this, we propose RelayLLM, a novel framework for efficient reasoning via token-level collaborative decoding. Unlike routers, RelayLLM empowers the SLM to act as an active controller that dynamically invokes the LLM only for critical tokens via a special command, effectively "relaying" the generation process. We introduce a two-stage training framework, including warm-up and Group Relative Policy Optimization (GRPO) to teach the model to balance independence with strategic help-seeking. Empirical results across six benchmarks demonstrate that RelayLLM achieves an average accuracy of 49.52%, effectively bridging the performance gap between the two models. Notably, this is achieved by invoking the LLM for only 1.07% of the total generated tokens, offering a 98.2% cost reduction compared to performance-matched random routers.