FR-Spec: Accelerating Large-Vocabulary Language Models via Frequency-Ranked Speculative Sampling
Weilin Zhao, Tengyu Pan, Xu Han, Yudi Zhang, Ao Sun, Yuxiang Huang, Kaihuo Zhang, Weilun Zhao, Yuxuan Li, Jianyong Wang, Zhiyuan Liu, Maosong Sun
2025-03-05
Summary
This paper talks about FR-Spec, a new method to make large AI language models work faster when generating text, especially for models with huge vocabularies
What's the problem?
Current methods for speeding up AI text generation don't work as well for models with very large vocabularies, like those with 128,000 words. This slows down the process of creating text, which is a big issue for making AI language models more useful in real-world applications
What's the solution?
The researchers created FR-Spec, which uses a clever trick to speed things up. Instead of looking at all possible words every time, it focuses on the words that are used most often. This makes the AI work much faster without losing accuracy. FR-Spec reduces the time spent on word selection by 75% and makes the whole process about 1.12 times faster than the previous best method
Why it matters?
This matters because it could make AI language models much more practical to use in everyday situations. Faster text generation means AI could respond more quickly in conversations, write content more efficiently, or process large amounts of text in less time. This could lead to more responsive AI assistants, better automatic translation services, and more efficient content creation tools
Abstract
Speculative sampling has emerged as an important technique for accelerating the auto-regressive generation process of large language models (LLMs) by utilizing a draft-then-verify mechanism to produce multiple tokens per forward pass. While state-of-the-art speculative sampling methods use only a single layer and a language modeling (LM) head as the draft model to achieve impressive layer compression, their efficiency gains are substantially reduced for large-vocabulary LLMs, such as Llama-3-8B with a vocabulary of 128k tokens. To address this, we present FR-Spec, a frequency-ranked <PRE_TAG>speculative sampling</POST_TAG> framework that optimizes draft candidate selection through vocabulary space compression. By constraining the draft search to a frequency-prioritized token subset, our method reduces LM Head computation overhead by 75% while ensuring the equivalence of the final output distribution. Experiments across multiple datasets demonstrate an average of 1.12times speedup over the state-of-the-art speculative sampling method EAGLE-2.