Jakiro: Boosting Speculative Decoding with Decoupled Multi-Head via MoE
Haiduo Huang, Fuwei Yang, Zhenhua Liu, Yixing Xu, Jinze Li, Yang Liu, Xuanwu Yin, Dong Li, Pengju Ren, Emad Barsoum
2025-02-11
Summary
This paper talks about Jakiro, a new method to make AI language models work faster and better when generating text. It improves on a technique called speculative decoding by using a system of independent experts to make more diverse and accurate predictions.
What's the problem?
Current speculative decoding methods use a smaller, faster model to guess multiple words at once, but this model often makes similar guesses because it's working from the same information. This limits how diverse and accurate the guesses can be, which slows down the overall process.
What's the solution?
The researchers created Jakiro, which uses a technique called Mixture of Experts. This allows different 'expert' parts of the model to make independent guesses, leading to more varied and accurate predictions. They also combined this with a hybrid strategy that uses different methods for the beginning and later parts of text generation, and added a way to compare features to improve accuracy.
Why it matters?
This matters because it makes large AI language models work faster and more accurately when generating text. This could lead to quicker and more reliable AI assistants, chatbots, and other text-generating tools, making them more useful in real-world applications where speed and accuracy are important.
Abstract
Speculative decoding (SD) accelerates large language model inference by using a smaller draft model to predict multiple tokens, which are then verified in parallel by the larger target model. However, the limited capacity of the draft model often necessitates tree-based sampling to improve prediction accuracy, where multiple candidates are generated at each step. We identify a key limitation in this approach: the candidates at the same step are derived from the same representation, limiting diversity and reducing overall effectiveness. To address this, we propose Jakiro, leveraging Mixture of Experts (MoE), where independent experts generate diverse predictions, effectively decoupling correlations among candidates. Furthermore, we introduce a hybrid inference strategy, combining autoregressive decoding for initial tokens with parallel decoding for subsequent stages, and enhance the latter with contrastive mechanism in features to improve accuracy. Our method significantly boosts prediction accuracy and achieves higher inference speedups. Extensive experiments across diverse models validate the effectiveness and robustness of our approach, establishing a new SOTA in speculative decoding. Our codes are available at https://github.com/haiduo/Jakiro.