Reward-Guided Speculative Decoding for Efficient LLM Reasoning

Baohao Liao, Yuhui Xu, Hanze Dong, Junnan Li, Christof Monz, Silvio Savarese, Doyen Sahoo, Caiming Xiong

2025-02-03

Reward-Guided Speculative Decoding for Efficient LLM Reasoning

Summary

This paper talks about a new way to make big AI language models think more efficiently called Reward-Guided Speculative Decoding (RSD). It's like having a smart assistant that helps a genius solve problems faster and better.

What's the problem?

Big AI language models are really smart, but they can be slow and use a lot of computer power when trying to solve complex problems, especially in math and reasoning tasks. This makes it hard to use them in real-world situations where we need quick answers or don't have super powerful computers.

What's the solution?

The researchers created RSD, which uses two AI models working together - a simpler, faster one and a more powerful one. The faster model tries to guess what the powerful model will say, and a special reward system helps decide when to use the powerful model. This is like having a quick-thinking student work on a problem and only asking the teacher for help when really needed. They also came up with a clever way to balance speed and accuracy, making sure the AI gives good answers without wasting time.

Why it matters?

This matters because it makes powerful AI language models much more practical to use in the real world. RSD can solve complex problems, even Olympic-level math questions, using up to 4.4 times less computer power while still being more accurate than other methods. This could help make advanced AI available for more people and businesses, even if they don't have access to the most powerful computers. It's a big step towards making smart AI systems that can reason efficiently, which could be used in all sorts of applications from education to scientific research.

Abstract

We introduce Reward-Guided Speculative Decoding (RSD), a novel framework aimed at improving the efficiency of inference in large language models (LLMs). RSD synergistically combines a lightweight draft model with a more powerful target model, incorporating a controlled bias to prioritize high-reward outputs, in contrast to existing speculative decoding methods that enforce strict unbiasedness. RSD employs a process reward model to evaluate intermediate decoding steps and dynamically decide whether to invoke the target model, optimizing the trade-off between computational cost and output quality. We theoretically demonstrate that a threshold-based mixture strategy achieves an optimal balance between resource utilization and performance. Extensive evaluations on challenging reasoning benchmarks, including Olympiad-level tasks, show that RSD delivers significant efficiency gains against decoding with the target model only (up to 4.4x fewer FLOPs), while achieving significant better accuracy than parallel decoding method on average (up to +3.5). These results highlight RSD as a robust and cost-effective approach for deploying LLMs in resource-intensive scenarios.

View Paper