Speculative Jacobi-Denoising Decoding for Accelerating Autoregressive Text-to-image Generation
Yao Teng, Fuyun Wang, Xian Liu, Zhekai Chen, Han Shi, Yu Wang, Zhenguo Li, Weiyang Liu, Difan Zou, Xihui Liu
2025-10-13
Summary
This paper introduces a new technique to speed up how images are created from text descriptions using artificial intelligence, specifically focusing on models that build images piece by piece.
What's the problem?
Current AI models that generate images from text are slow because they create images one small part at a time, like writing a sentence word by word. This requires the AI to run through its calculations thousands of times for just one image, making the process inefficient and time-consuming.
What's the solution?
The researchers developed a method called Speculative Jacobi-Denoising Decoding, or SJD2. Essentially, they taught the AI to work with slightly 'noisy' versions of the image parts and predict what the clean, final parts should be. This allows the AI to guess multiple parts of the image at the same time, instead of one at a time, and then refine those guesses. It’s like sketching out a rough draft of an image all at once and then cleaning it up, rather than drawing each line perfectly in order.
Why it matters?
This research is important because it makes image generation much faster without sacrificing the quality of the images. Faster image generation means quicker turnaround times for artists, designers, and anyone using AI to create visual content, opening up possibilities for more rapid prototyping and creative exploration.
Abstract
As a new paradigm of visual content generation, autoregressive text-to-image models suffer from slow inference due to their sequential token-by-token decoding process, often requiring thousands of model forward passes to generate a single image. To address this inefficiency, we propose Speculative Jacobi-Denoising Decoding (SJD2), a framework that incorporates the denoising process into Jacobi iterations to enable parallel token generation in autoregressive models. Our method introduces a next-clean-token prediction paradigm that enables the pre-trained autoregressive models to accept noise-perturbed token embeddings and predict the next clean tokens through low-cost fine-tuning. This denoising paradigm guides the model towards more stable Jacobi trajectories. During inference, our method initializes token sequences with Gaussian noise and performs iterative next-clean-token-prediction in the embedding space. We employ a probabilistic criterion to verify and accept multiple tokens in parallel, and refine the unaccepted tokens for the next iteration with the denoising trajectory. Experiments show that our method can accelerate generation by reducing model forward passes while maintaining the visual quality of generated images.