Wavelets Are All You Need for Autoregressive Image Generation

Wael Mattar, Idan Levy, Nir Sharon, Shai Dekel

2024-07-02

Wavelets Are All You Need for Autoregressive Image Generation

Summary

This paper talks about a new method for generating images using a technique called wavelet image coding combined with a specialized language model. This approach allows for creating images in a more efficient way by focusing on important visual details.

What's the problem?

Generating images using traditional methods can be computationally expensive and inefficient, especially when trying to capture both coarse and fine details. Many existing techniques struggle to balance the quality of the generated images with the resources needed to create them, which can limit their practical applications.

What's the solution?

To solve this problem, the authors propose a two-part method. First, they use wavelet image coding, which breaks down an image into different levels of detail, starting from the most important features. This allows the system to focus on significant visual elements while ignoring less important ones. Second, they introduce a modified transformer model that is designed to work with these 'wavelet tokens.' This model learns the relationships between different parts of the image, making it better at generating high-quality visuals based on the structured information provided by wavelets.

Why it matters?

This research is important because it offers a more effective way to generate images that can save computational resources while still producing high-quality results. By using wavelet coding and an optimized transformer model, this approach could lead to advancements in fields like computer graphics, virtual reality, and any area where high-quality image generation is needed.

Abstract

In this paper, we take a new approach to autoregressive image generation that is based on two main ingredients. The first is wavelet image coding, which allows to tokenize the visual details of an image from coarse to fine details by ordering the information starting with the most significant bits of the most significant wavelet coefficients. The second is a variant of a language transformer whose architecture is re-designed and optimized for token sequences in this 'wavelet language'. The transformer learns the significant statistical correlations within a token sequence, which are the manifestations of well-known correlations between the wavelet subbands at various resolutions. We show experimental results with conditioning on the generation process.

View Paper