Scalable Autoregressive Image Generation with Mamba
Haopeng Li, Jinyue Yang, Kexin Wang, Xuerui Qiu, Yuhong Chou, Xin Li, Guoqi Li
2024-08-23

Summary
This paper introduces AiM, a new autoregressive image generation model that uses a unique architecture called Mamba to create high-quality images efficiently.
What's the problem?
Generating images with existing models can be slow and requires a lot of memory, especially when dealing with long sequences of data. This makes it challenging to produce high-quality images quickly and effectively.
What's the solution?
The authors developed AiM, which replaces traditional transformer models with the Mamba architecture. Mamba is designed to handle long sequences of data more efficiently. AiM uses a method called next-token prediction to generate images without needing complex modifications. The model comes in different sizes, ranging from 148 million to 1.3 billion parameters, and it achieves impressive results on benchmarks like ImageNet1K while being faster than other models.
Why it matters?
This research is important because it improves the speed and quality of image generation, making it easier for developers and artists to create visuals for various applications. By enhancing how AI can generate images, it opens up new possibilities in fields like gaming, film, and digital art.
Abstract
We introduce AiM, an autoregressive (AR) image generative model based on Mamba architecture. AiM employs Mamba, a novel state-space model characterized by its exceptional performance for long-sequence modeling with linear time complexity, to supplant the commonly utilized Transformers in AR image generation models, aiming to achieve both superior generation quality and enhanced inference speed. Unlike existing methods that adapt Mamba to handle two-dimensional signals via multi-directional scan, AiM directly utilizes the next-token prediction paradigm for autoregressive image generation. This approach circumvents the need for extensive modifications to enable Mamba to learn 2D spatial representations. By implementing straightforward yet strategically targeted modifications for visual generative tasks, we preserve Mamba's core structure, fully exploiting its efficient long-sequence modeling capabilities and scalability. We provide AiM models in various scales, with parameter counts ranging from 148M to 1.3B. On the ImageNet1K 256*256 benchmark, our best AiM model achieves a FID of 2.21, surpassing all existing AR models of comparable parameter counts and demonstrating significant competitiveness against diffusion models, with 2 to 10 times faster inference speed. Code is available at https://github.com/hp-l33/AiM