Continuous Speculative Decoding for Autoregressive Image Generation

Zili Wang, Robert Zhang, Kun Ding, Qi Yang, Fei Li, Shiming Xiang

2024-11-20

Continuous Speculative Decoding for Autoregressive Image Generation

Summary

This paper discusses Continuous Speculative Decoding, a new technique for speeding up the process of generating images using autoregressive models, which predict pixels sequentially.

What's the problem?

Autoregressive image generation models, which create images by predicting each pixel one at a time based on previous pixels, can be very slow and computationally expensive. This makes it challenging to generate high-quality images quickly, especially when using continuous values instead of discrete tokens.

What's the solution?

The authors propose a method called Continuous Speculative Decoding that adapts existing techniques used in language models to work with continuous image data. They analyze how the model's output behaves and develop a new way to accept or reject generated pixels based on their quality. This method includes techniques like denoising trajectory alignment and a careful sampling process to improve speed without sacrificing image quality. Their experiments show that this new approach can speed up image generation by 2.33 times while still maintaining good results.

Why it matters?

This research is important because it enhances the efficiency of generating images with autoregressive models, making it possible to create high-quality visuals much faster. This improvement could benefit various applications in fields like computer graphics, video game design, and artificial intelligence, where quick and realistic image generation is crucial.

Abstract

Continuous-valued Autoregressive (AR) image generation models have demonstrated notable superiority over their discrete-token counterparts, showcasing considerable reconstruction quality and higher generation fidelity. However, the computational demands of the autoregressive framework result in significant inference overhead. While speculative decoding has proven effective in accelerating Large Language Models (LLMs), their adaptation to continuous-valued visual autoregressive models remains unexplored. This work generalizes the speculative decoding algorithm from discrete tokens to continuous space. By analyzing the intrinsic properties of output distribution, we establish a tailored acceptance criterion for the diffusion distributions prevalent in such models. To overcome the inconsistency that occurred in speculative decoding output distributions, we introduce denoising trajectory alignment and token pre-filling methods. Additionally, we identify the hard-to-sample distribution in the rejection phase. To mitigate this issue, we propose a meticulous acceptance-rejection sampling method with a proper upper bound, thereby circumventing complex integration. Experimental results show that our continuous speculative decoding achieves a remarkable 2.33times speed-up on off-the-shelf models while maintaining the output distribution. Codes will be available at https://github.com/MarkXCloud/CSpD

View Paper