A Spark of Vision-Language Intelligence: 2-Dimensional Autoregressive Transformer for Efficient Finegrained Image Generation

Liang Chen, Sinan Tan, Zefan Cai, Weichu Xie, Haozhe Zhao, Yichi Zhang, Junyang Lin, Jinze Bai, Tianyu Liu, Baobao Chang

2024-10-09

A Spark of Vision-Language Intelligence: 2-Dimensional Autoregressive Transformer for Efficient Finegrained Image Generation

Summary

This paper introduces the 2-Dimensional Autoregressive (DnD) Transformer, a new model designed to improve the quality of images generated by AI systems while reducing information loss.

What's the problem?

Current methods for generating images using vector-quantization (VQ) often lose important details, leading to lower quality images. Traditional models typically process images in one direction, which limits their ability to capture complex details effectively.

What's the solution?

The authors developed the DnD Transformer, which processes images in two dimensions instead of just one. This allows the model to predict more detailed codes for each image and generate higher quality outputs without needing more computing power. The DnD Transformer can also create images that include both text and graphics, showing a new level of understanding in combining different types of information.

Why it matters?

This research is significant because it enhances how AI can generate detailed and accurate images, which is useful for various applications like graphic design, video games, and educational tools. By improving image generation capabilities, the DnD Transformer opens up new possibilities for creating rich visual content.

Abstract

This work tackles the information loss bottleneck of vector-quantization (VQ) autoregressive image generation by introducing a novel model architecture called the 2-Dimensional Autoregression (DnD) Transformer. The DnD-Transformer predicts more codes for an image by introducing a new autoregression direction, model depth, along with the sequence length direction. Compared to traditional 1D autoregression and previous work utilizing similar 2D image decomposition such as RQ-Transformer, the DnD-Transformer is an end-to-end model that can generate higher quality images with the same backbone model size and sequence length, opening a new optimization perspective for autoregressive image generation. Furthermore, our experiments reveal that the DnD-Transformer's potential extends beyond generating natural images. It can even generate images with rich text and graphical elements in a self-supervised manner, demonstrating an understanding of these combined modalities. This has not been previously demonstrated for popular vision generative models such as diffusion models, showing a spark of vision-language intelligence when trained solely on images. Code, datasets and models are open at https://github.com/chenllliang/DnD-Transformer.

View Paper