LiteStage: Latency-aware Layer Skipping for Multi-stage Reasoning

Beomseok Kang, Jiwon Song, Jae-Joon Kim

2025-10-17

LiteStage: Latency-aware Layer Skipping for Multi-stage Reasoning

Summary

This paper focuses on making complex problem-solving with small language models faster without significantly sacrificing accuracy.

What's the problem?

When you break down a difficult question into smaller steps for a language model to solve, it takes more time overall. Existing methods to speed things up, like skipping some of the model's internal layers, don't work well in this situation because some steps need more layers than others, and the model often keeps generating text even after it's confident it has the answer.

What's the solution?

The researchers developed a system called LiteStage. It first figures out the best number of layers to use for each step of the problem, and then it stops the model from generating more text once it's sure of the answer. This combination of planning and early stopping makes the process more efficient.

Why it matters?

This work is important because it allows smaller, faster language models to tackle complicated reasoning tasks more effectively. By reducing the time it takes to get an answer, it makes these models more practical for real-world applications where speed is crucial, all while maintaining a good level of accuracy.

Abstract

Multi-stage reasoning has emerged as an effective strategy for enhancing the reasoning capability of small language models by decomposing complex problems into sequential sub-stages. However, this comes at the cost of increased latency. We observe that existing adaptive acceleration techniques, such as layer skipping, struggle to balance efficiency and accuracy in this setting due to two key challenges: (1) stage-wise variation in skip sensitivity, and (2) the generation of redundant output tokens. To address these, we propose LiteStage, a latency-aware layer skipping framework for multi-stage reasoning. LiteStage combines a stage-wise offline search that allocates optimal layer budgets with an online confidence-based generation early exit to suppress unnecessary decoding. Experiments on three benchmarks, e.g., OBQA, CSQA, and StrategyQA, show that LiteStage achieves up to 1.70x speedup with less than 4.0% accuracy loss, outperforming prior training-free layer skipping methods.

View Paper