Mirror Speculative Decoding: Breaking the Serial Barrier in LLM Inference

Nikhil Bhendawade, Kumari Nishu, Arnav Kundu, Chris Bartels, Minsik Cho, Irina Belousova

2025-10-17

Mirror Speculative Decoding: Breaking the Serial Barrier in LLM Inference

Summary

This paper introduces a new method called Mirror Speculative Decoding (Mirror-SD) to make large language models (LLMs) run much faster without sacrificing accuracy.

What's the problem?

Currently, speeding up LLMs using 'speculative decoding' – where a smaller, faster model predicts what the larger, more accurate model will say – is limited by how long it takes for that smaller model to make its predictions. Making the predictions longer improves accuracy, but also adds more delay, creating a trade-off between speed and correctness. Existing attempts to fix this either reduce accuracy or don't scale well to very large models.

What's the solution?

Mirror-SD tackles this problem by running two models in parallel. The 'draft' model still tries to predict the output, but the main 'target' model *also* simultaneously tries to figure out if the draft model is wrong and how to correct it. This turns the prediction process into two complementary tasks happening at the same time. They also use a technique called 'speculative streaming' where the draft model generates multiple potential words at once, further reducing delay. Finally, the system smartly distributes the work between different types of computer chips (GPUs and NPUs) to maximize efficiency.

Why it matters?

This research is important because it significantly speeds up LLMs – achieving 2.8 to 5.8 times faster performance – without losing accuracy. This means we can get responses from these powerful AI models much quicker, making them more practical for real-world applications and allowing for more complex tasks to be completed in a reasonable timeframe.

Abstract

Speculative decoding accelerates LLM inference by using a draft model to look ahead, but gains are capped by the cost of autoregressive draft generation: increasing draft size elevates acceptance rates but introduces additional latency overhead exacerbating the speed-accuracy tradeoff. Prior methods (Medusa, Hydra, EAGLE) partially reduce draft cost but either degrade acceptance or introduce overheads that limit scaling. We present Mirror Speculative Decoding (Mirror-SD), an inference algorithm that breaks the latency-acceptance tradeoff. Mirror-SD launches branch-complete rollouts from early-exit signals in parallel with the target model's suffix and explicitly maps computation across heterogeneous accelerators (GPU and NPU) to exploit cross-device parallelism. The draft speculates forward continuations for the target to verify, while the target simultaneously speculates correction paths for the draft, converting speculation into two complementary execution pipelines. To further cut draft latency without weakening acceptance semantics, we add speculative streaming so the draft emits multiple tokens per step. This dual strategy of parallel heterogeneous execution plus multi-token speculative streaming pushes speculative decoding toward its ideal regime of high acceptance with low overhead. On SpecBench with server-scale models from 14B to 66B parameters, Mirror-SD delivers consistent end-to-end gains, achieving 2.8x-5.8x wall-time speedups across diverse tasks and a 30% average relative improvement over the strongest baseline, EAGLE3.

View Paper