DEER: Draft with Diffusion, Verify with Autoregressive Models

Zicong Cheng, Guo-Wei Yang, Jia Li, Zhijie Deng, Meng-Hao Guo, Shi-Min Hu

2025-12-18

DEER: Draft with Diffusion, Verify with Autoregressive Models

Summary

This paper focuses on making large language models (LLMs) work faster, specifically when they're used to create agents or solve complex problems that require reasoning. The core issue is that LLMs generate text one word at a time, which takes time.

What's the problem?

Current methods to speed up LLMs, called 'speculative decoding,' use a faster 'draft' model to predict text, and then a more accurate but slower 'verifier' model to check it. However, these draft models are still built like traditional LLMs, meaning they make predictions sequentially and their accuracy degrades over longer sequences, limiting how much faster the process can be. Essentially, the verifier loses trust in the drafter as it goes on.

What's the solution?

The researchers discovered that a different type of LLM, called a 'diffusion LLM' (dLLM), works much better as a drafter. dLLMs generate text in a fundamentally different way that doesn't suffer from the same accuracy problems as traditional LLMs. They developed a system called DEER that uses a dLLM to quickly draft text and then uses a regular LLM to verify it. They also improved the training process to make the dLLM and the verifier work well together, allowing the dLLM to draft longer, more accurate segments of text at once.

Why it matters?

This research is important because it significantly speeds up LLMs without sacrificing accuracy. Their DEER system is much faster than previous methods, achieving a 5.54x speedup in tests, which means LLMs can be used for more real-time applications and complex tasks. This makes AI agents and reasoning systems more practical and efficient.

Abstract

Efficiency, as a critical practical challenge for LLM-driven agentic and reasoning systems, is increasingly constrained by the inherent latency of autoregressive (AR) decoding. Speculative decoding mitigates this cost through a draft-verify scheme, yet existing approaches rely on AR draft models (a.k.a., drafters), which introduce two fundamental issues: (1) step-wise uncertainty accumulation leads to a progressive collapse of trust between the target model and the drafter, and (2) inherently sequential decoding of AR drafters. Together, these factors cause limited speedups. In this paper, we show that a diffusion large language model (dLLM) drafters can naturally overcome these issues through its fundamentally different probabilistic modeling and efficient parallel decoding strategy. Building on this insight, we introduce DEER, an efficient speculative decoding framework that drafts with diffusion and verifies with AR models. To enable high-quality drafting, DEER employs a two-stage training pipeline to align the dLLM-based drafters with the target AR model, and further adopts single-step decoding to generate long draft segments. Experiments show DEER reaches draft acceptance lengths of up to 32 tokens, far surpassing the 10 tokens achieved by EAGLE-3. Moreover, on HumanEval with Qwen3-30B-A3B, DEER attains a 5.54x speedup, while EAGLE-3 achieves only 2.41x. Code, model, demo, etc, will be available at https://czc726.github.io/DEER/

View Paper