TiDAR: Think in Diffusion, Talk in Autoregression
Jingyu Liu, Xin Dong, Zhifan Ye, Rishabh Mehta, Yonggan Fu, Vartika Singh, Jan Kautz, Ce Zhang, Pavlo Molchanov
2025-11-13
Summary
This paper introduces a new model architecture called TiDAR that aims to combine the speed of diffusion models with the high quality of autoregressive models for generating text.
What's the problem?
Currently, there's a trade-off between how quickly a language model can generate text and how good that text is. Diffusion models are fast because they can work on different parts of the text at the same time, but they often don't produce text as coherent or natural-sounding as autoregressive models. Autoregressive models are great at quality, but they generate text slowly, one word at a time. Existing attempts to combine the two either sacrifice speed or quality, or don't fully utilize the potential benefits of both approaches.
What's the solution?
TiDAR solves this by using a two-step process within a single pass. First, it 'thinks' using a diffusion model to quickly draft potential text segments in parallel. Then, it 'talks' by using an autoregressive model to refine and finalize the output, ensuring high quality. Special attention mechanisms are used to make this process efficient and allow the model to take full advantage of available computing power. It's designed to be a complete model, easy to use without extra steps.
Why it matters?
This research is important because it's the first to successfully bridge the gap between speed and quality in language models. TiDAR achieves a quality level comparable to traditional autoregressive models, but generates text much faster – up to 5.91 times faster – making it a significant step towards more efficient and powerful text generation.
Abstract
Diffusion language models hold the promise of fast parallel generation, while autoregressive (AR) models typically excel in quality due to their causal structure aligning naturally with language modeling. This raises a fundamental question: can we achieve a synergy with high throughput, higher GPU utilization, and AR level quality? Existing methods fail to effectively balance these two aspects, either prioritizing AR using a weaker model for sequential drafting (speculative decoding), leading to lower drafting efficiency, or using some form of left-to-right (AR-like) decoding logic for diffusion, which still suffers from quality degradation and forfeits its potential parallelizability. We introduce TiDAR, a sequence-level hybrid architecture that drafts tokens (Thinking) in Diffusion and samples final outputs (Talking) AutoRegressively - all within a single forward pass using specially designed structured attention masks. This design exploits the free GPU compute density, achieving a strong balance between drafting and verification capacity. Moreover, TiDAR is designed to be serving-friendly (low overhead) as a standalone model. We extensively evaluate TiDAR against AR models, speculative decoding, and diffusion variants across generative and likelihood tasks at 1.5B and 8B scales. Thanks to the parallel drafting and sampling as well as exact KV cache support, TiDAR outperforms speculative decoding in measured throughput and surpasses diffusion models like Dream and Llada in both efficiency and quality. Most notably, TiDAR is the first architecture to close the quality gap with AR models while delivering 4.71x to 5.91x more tokens per second.