< Explain other AI papers

TAPS: Task Aware Proposal Distributions for Speculative Sampling

Mohamad Zbib, Mohamad Bazzi, Ammar Mohanna, Hasan Abed Al Kader Hammoud, Bernard Ghanem

2026-03-31

TAPS: Task Aware Proposal Distributions for Speculative Sampling

Summary

This paper investigates how well 'speculative decoding' works when the initial, faster model making guesses is trained on different kinds of data. Speculative decoding is a technique to speed up large language models by having a smaller model quickly propose what comes next, and then a larger, more accurate model checks those proposals.

What's the problem?

When using speculative decoding, the smaller 'draft' model is often trained on a general mix of text. This research asks: does the *type* of data the draft model is trained on significantly impact how well speculative decoding performs on specific tasks? Specifically, does training a draft model on math problems help it with math tasks, or training it on conversational data help it with chat-based tasks?

What's the solution?

Researchers trained several draft models – some on math instruction data, some on conversational data (like chat logs), and some on a mix of both. They then tested how well these different draft models performed when used with a larger language model on various benchmarks, including math problem solving and multi-turn conversations. They also experimented with combining different draft models during the decoding process, trying both simple averaging and a more intelligent method based on how confident each draft model was in its predictions.

Why it matters?

The findings show that the training data of the draft model *does* matter a lot. A draft model trained on math excels at math problems, while one trained on conversations is better at chat. Furthermore, simply combining draft models doesn't work well; instead, intelligently routing predictions based on the draft model's confidence leads to the best results. This means that to get the most out of speculative decoding, you need to carefully consider what kind of data you train your draft model on, and how you combine multiple draft models for different tasks.

Abstract

Speculative decoding accelerates autoregressive generation by letting a lightweight draft model propose future tokens that a larger target model then verifies in parallel. In practice, however, draft models are usually trained on broad generic corpora, which leaves it unclear how much speculative decoding quality depends on the draft training distribution. We study this question with lightweight HASS and EAGLE-2 drafters trained on MathInstruct, ShareGPT, and mixed-data variants, evaluated on MT-Bench, GSM8K, MATH-500, and SVAMP. Measured by acceptance length, task-specific training yields clear specialization: MathInstruct-trained drafts are strongest on reasoning benchmarks, while ShareGPT-trained drafts are strongest on MT-Bench. Mixed-data training improves robustness, but larger mixtures do not dominate across decoding temperatures. We also study how to combine specialized drafters at inference time. Naive checkpoint averaging performs poorly, whereas confidence-based routing improves over single-domain drafts and merged-tree verification yields the highest acceptance length overall for both backbones. Finally, confidence is a more useful routing signal than entropy: rejected tokens tend to have higher entropy, but confidence produces much clearer benchmark-level routing decisions. These results show that speculative decoding quality depends not only on draft architecture, but also on the match between draft training data and downstream workload, and that specialized drafters are better combined at inference time than in weight space.