Batch Speculative Decoding Done Right

Ranran Haoran Zhang, Soumik Dey, Ashirbad Mishra, Hansi Wu, Binbin Li, Rui Zhang

2025-10-29

Summary

This paper focuses on making large language models (LLMs) run faster, specifically when processing multiple requests at the same time, a process called batching. It improves a technique called speculative decoding, which uses a smaller, quicker model to predict what the larger, more accurate model will say, and then verifies those predictions.

What's the problem?

When you try to speed up LLMs by processing multiple requests together (batching), a problem arises because each request might need a different number of predictions from the smaller, faster model. This creates uneven data shapes, messing up how the LLM understands the order of words and how it remembers previous parts of the text, ultimately leading to incorrect or inconsistent results. Existing attempts to batch speculative decoding often fail to produce the same answers as the standard, slower way of generating text, which is a major issue.

What's the solution?

The researchers first figured out exactly what needs to happen to ensure the faster, batched version gives the same answers as the standard version. Then, they created a new system called EXSPEC. EXSPEC works by grouping requests together that are at similar stages, so the data stays organized. It avoids constantly rearranging data, which is a major source of slowdown in other approaches. They show that a lot of the time spent in batch speculative decoding is actually spent realigning the data, and EXSPEC minimizes this.

Why it matters?

This work is important because it makes LLMs significantly faster and more efficient, especially when handling many requests simultaneously. This is crucial for real-world applications where speed and cost are important. The new method achieves up to a three-times speed increase without sacrificing accuracy and can be easily added to existing LLM systems without needing to change the core code.

Abstract

Speculative decoding speeds up LLM inference by using a small draft model to propose multiple tokens that a target model verifies in parallel. Extending this idea to batches is essential for production serving, but it introduces the ragged tensor problem: sequences in the same batch accept different numbers of draft tokens, breaking right-alignment and corrupting position IDs, attention masks, and KV-cache state. We show that several existing batch implementations violate output equivalence-the fundamental requirement that speculative decoding must produce identical token sequences to standard autoregressive generation. These violations occur precisely due to improper handling of the ragged tensor problem. In response, we (1) characterize the synchronization requirements that guarantee correctness, (2) present a correctness-first batch speculative decoding EQSPEC that exposes realignment as consuming 40% of overhead, and (3) introduce EXSPEC, which maintains a sliding pool of sequences and dynamically forms same-length groups, to reduce the realignment overhead while preserving per-sequence speculative speedups. On the SpecBench dataset, across Vicuna-7B/68M, Qwen3-8B/0.6B, and GLM-4-9B/0.6B target/draft pairs, our approach achieves up to 3times throughput improvement at batch size 8 compared to batch size 1, with efficient scaling through batch size 8, while maintaining 95% output equivalence. Our method requires no custom kernels and integrates cleanly with existing inference stacks. Our code is available at https://github.com/eBay/spec_dec.

View Paper