Accelerating Auto-regressive Text-to-Image Generation with Training-free Speculative Jacobi Decoding
Yao Teng, Han Shi, Xian Liu, Xuefei Ning, Guohao Dai, Yu Wang, Zhenguo Li, Xihui Liu
2024-10-07

Summary
This paper presents a new method called Speculative Jacobi Decoding (SJD) that speeds up the process of generating images from text by improving how auto-regressive models make predictions.
What's the problem?
Current auto-regressive models for text-to-image generation can create high-quality images, but they take a long time because they need to predict the next part of the image step-by-step. This can involve hundreds or even thousands of individual predictions, which is very time-consuming and inefficient.
What's the solution?
To solve this problem, the authors developed SJD, a new decoding algorithm that allows the model to make multiple predictions at once instead of one at a time. SJD uses a probabilistic approach to decide which predictions to keep, which helps maintain randomness and diversity in the generated images. This means that SJD can produce high-quality images in fewer steps compared to traditional methods, making the process faster and less resource-intensive.
Why it matters?
This research is important because it shows how we can make image generation from text much more efficient without losing quality. By reducing the time and computational resources needed, SJD could enable faster and more accessible applications of AI in areas like art creation, advertising, and virtual reality, where generating images quickly is crucial.
Abstract
The current large auto-regressive models can generate high-quality, high-resolution images, but these models require hundreds or even thousands of steps of next-token prediction during inference, resulting in substantial time consumption. In existing studies, Jacobi decoding, an iterative parallel decoding algorithm, has been used to accelerate the auto-regressive generation and can be executed without training. However, the Jacobi decoding relies on a deterministic criterion to determine the convergence of iterations. Thus, it works for greedy decoding but is incompatible with sampling-based decoding which is crucial for visual quality and diversity in the current auto-regressive text-to-image generation. In this paper, we propose a training-free probabilistic parallel decoding algorithm, Speculative Jacobi Decoding (SJD), to accelerate auto-regressive text-to-image generation. By introducing a probabilistic convergence criterion, our SJD accelerates the inference of auto-regressive text-to-image generation while maintaining the randomness in sampling-based token decoding and allowing the model to generate diverse images. Specifically, SJD facilitates the model to predict multiple tokens at each step and accepts tokens based on the probabilistic criterion, enabling the model to generate images with fewer steps than the conventional next-token-prediction paradigm. We also investigate the token initialization strategies that leverage the spatial locality of visual data to further improve the acceleration ratio under specific scenarios. We conduct experiments for our proposed SJD on multiple auto-regressive text-to-image generation models, showing the effectiveness of model acceleration without sacrificing the visual quality.