DuoDecoding: Hardware-aware Heterogeneous Speculative Decoding with Dynamic Multi-Sequence Drafting
Kai Lv, Honglin Guo, Qipeng Guo, Xipeng Qiu
2025-03-04
Summary
This paper talks about DuoDecoding, a new method to make large AI language models generate text faster without losing quality. It uses a clever way of splitting the work between a computer's CPU and GPU to speed things up.
What's the problem?
Large language models are really good at many tasks, but they're slow because they generate text one word at a time. Some methods try to speed this up by guessing what words might come next, but these methods often slow things down in other ways or don't work as well as they should.
What's the solution?
The researchers created DuoDecoding, which does two main things. First, it uses both the CPU and GPU of a computer at the same time, with one part guessing words and the other checking them. Second, it uses a smart system to decide how many words to guess at once, based on how sure the model is about its guesses. This helps the model work faster without making mistakes.
Why it matters?
This matters because it can make AI language models work much faster in real-world applications. The researchers found that DuoDecoding can make text generation up to 2.61 times faster than normal methods, and it starts generating text more quickly too. This could make AI writing tools, chatbots, and other language AI applications much more responsive and useful in everyday situations.
Abstract
Large language models (LLMs) exhibit exceptional performance across a wide range of tasks; however, their token-by-token autoregressive generation process significantly hinders inference speed. Speculative decoding presents a promising draft-then-verify framework that reduces generation latency while maintaining output distribution fidelity. Nevertheless, the draft model introduces additional computational overhead, becoming a performance bottleneck and increasing the time to first token (TTFT). Previous approaches to mitigate draft model overhead have primarily relied on heuristics and generally failed to match the quality of the draft language models. To address these challenges, we propose DuoDecoding, a novel approach that strategically deploys the draft and target models on the CPU and GPU respectively, enabling parallel decoding while preserving draft quality. Our method incorporates a hardware-aware optimal draft budget to minimize idle times and employs dynamic multi-sequence drafting to enhance draft quality. Extensive experiments across seven tasks show that DuoDecoding achieves up to 2.61x speedup in generation latency, while reducing TTFT to 83% of that in conventional speculative decoding. The Code is available at https://github.com/KaiLv69/DuoDecoding.