AdaSPEC: Selective Knowledge Distillation for Efficient Speculative Decoders
Yuezhou Hu, Jiaxin Guo, Xinyu Feng, Tuo Zhao
2025-10-24
Summary
This paper introduces a new technique called AdaSPEC to speed up how quickly large language models, like those used for chatbots, generate text. It focuses on making a smaller, faster 'draft' model work better with a larger, more accurate 'target' model.
What's the problem?
Currently, to make these smaller draft models, researchers use a process called Knowledge Distillation, which tries to make the draft model mimic the target model exactly. However, the draft model is limited in what it can learn, and forcing it to copy everything from the larger model isn't the most effective way to improve speed. The goal isn't to perfectly copy the large model, but to get the draft model to make enough correct predictions that the larger model will approve them, maximizing the 'acceptance rate'.
What's the solution?
AdaSPEC solves this by being smarter about *what* the draft model learns. It uses another model to identify the trickiest parts of the target model's knowledge – the things the draft model struggles with the most. Then, it focuses the learning process on the easier parts, allowing the draft model to become really good at those and get a higher acceptance rate from the target model. Essentially, it avoids overwhelming the draft model with information it can't handle.
Why it matters?
This is important because it allows for faster text generation from large language models without sacrificing quality. By improving the acceptance rate of the draft model, the overall system can generate text more quickly and efficiently, which is crucial for real-time applications like chatbots and virtual assistants. The improvements shown across different tasks demonstrate its broad applicability.
Abstract
Speculative Decoding (SD) accelerates large language model inference by employing a small draft model to generate predictions, which are then verified by a larger target model. The effectiveness of SD hinges on the alignment between these models, which is typically enhanced by Knowledge Distillation (KD). However, conventional KD methods aim to minimize the KL divergence between the draft and target models across all tokens, a goal that is misaligned with the true objective of SD, which is to maximize token acceptance rate. Therefore, draft models often struggle to fully assimilate the target model's knowledge due to capacity constraints, leading to suboptimal performance. To address this challenge, we propose AdaSPEC, a novel method that incorporates selective token filtering into the KD process. AdaSPEC utilizes a reference model to identify and filter out difficult-to-fit tokens, enabling the distillation of a draft model that better aligns with the target model on simpler tokens. This approach improves the overall token acceptance rate without compromising generation quality. We evaluate AdaSPEC across diverse tasks, including arithmetic reasoning, instruction-following, coding, and summarization, using model configurations of 31M/1.4B and 350M/2.7B parameters. Our results demonstrate that AdaSPEC consistently outperforms the state-of-the-art DistillSpec method, achieving higher acceptance rates across all tasks (up to 15\%). The code is publicly available at https://github.com/yuezhouhu/adaspec.