GlimpRouter: Efficient Collaborative Inference by Glimpsing One Token of Thoughts

Wenhao Zeng, Xuteng Zhang, Yuling Shi, Chao Hu, Yuting Chen, Beijun Shen, Xiaodong Gu

2026-01-13

GlimpRouter: Efficient Collaborative Inference by Glimpsing One Token of Thoughts

Summary

This paper tackles the issue of how to make powerful 'Large Reasoning Models' (LRMs) faster and cheaper to use. These models are great at complex problem-solving by breaking things down into steps, but doing so takes a lot of time and computing power.

What's the problem?

Currently, when using these step-by-step reasoning models, every step requires the full power of the large model, which is slow and expensive. Existing attempts to share the workload between smaller, faster models and larger, more powerful ones struggle to decide *when* a step needs the big model's help. They either look at the words already generated (which adds time) or check the answer after the step is done (also adding time and potentially not fixing errors early).

What's the solution?

The researchers noticed that the very first word generated in a reasoning step can reveal how difficult that step will be. They call this the 'Aha Moment' – if the model is unsure right from the start (indicated by a high level of randomness in the first word), it likely needs the larger model. They created a system called 'GlimpRouter' that uses a small model to generate just the first word of each step. If that first word is very uncertain, the step is sent to the large model; otherwise, the small model continues. This doesn't require any extra training!

Why it matters?

This is important because it offers a simple and effective way to speed up and reduce the cost of using these powerful reasoning models. By quickly identifying difficult steps and only using the large model when necessary, they achieved significant improvements in both speed and accuracy on several challenging tasks, suggesting a smarter way to allocate computing resources for complex problem-solving.

Abstract

Large Reasoning Models (LRMs) achieve remarkable performance by explicitly generating multi-step chains of thought, but this capability incurs substantial inference latency and computational cost. Collaborative inference offers a promising solution by selectively allocating work between lightweight and large models, yet a fundamental challenge remains: determining when a reasoning step requires the capacity of a large model or the efficiency of a small model. Existing routing strategies either rely on local token probabilities or post-hoc verification, introducing significant inference overhead. In this work, we propose a novel perspective on step-wise collaboration: the difficulty of a reasoning step can be inferred from its very first token. Inspired by the "Aha Moment" phenomenon in LRMs, we show that the entropy of the initial token serves as a strong predictor of step difficulty. Building on this insight, we introduce GlimpRouter, a training-free step-wise collaboration framework. GlimpRouter employs a lightweight model to generate only the first token of each reasoning step and routes the step to a larger model only when the initial token entropy exceeds a threshold. Experiments on multiple benchmarks demonstrate that our approach significantly reduces inference latency while preserving accuracy. For instance, GlimpRouter attains a substantial 10.7% improvement in accuracy while reducing inference latency by 25.9% compared to a standalone large model on AIME25. These results suggest a simple yet effective mechanism for reasoning: allocating computation based on a glimpse of thought rather than full-step evaluation.

View Paper