Open-MAGVIT2: An Open-Source Project Toward Democratizing Auto-regressive Visual Generation

Zhuoyan Luo, Fengyuan Shi, Yixiao Ge, Yujiu Yang, Limin Wang, Ying Shan

2024-09-09

Open-MAGVIT2: An Open-Source Project Toward Democratizing Auto-regressive Visual Generation

Summary

This paper talks about Open-MAGVIT2, a project that creates open-source image generation models to make it easier for anyone to generate images based on text descriptions.

What's the problem?

Generating high-quality images from text descriptions can be challenging, especially when existing models are not accessible to everyone or require a lot of computational resources. Many current models are also limited in their vocabulary, which can affect the quality of the images they produce.

What's the solution?

Open-MAGVIT2 introduces a series of auto-regressive image generation models that range in size from 300 million to 1.5 billion parameters. It replicates Google's MAGVIT-v2 tokenizer, which has a very large vocabulary, allowing for better image reconstruction. The project also includes techniques like asymmetric token factorization and 'next sub-token prediction' to improve how the model generates images with high quality and detail. All the models and codes are made available to the public to encourage creativity and innovation in visual generation.

Why it matters?

This research is important because it democratizes access to powerful image generation technology, allowing more people to create and experiment with visual content. By making these tools open-source, it encourages collaboration and innovation in the field of artificial intelligence and creative design.

Abstract

We present Open-MAGVIT2, a family of auto-regressive image generation models ranging from 300M to 1.5B. The Open-MAGVIT2 project produces an open-source replication of Google's MAGVIT-v2 tokenizer, a tokenizer with a super-large codebook (i.e., 2^{18} codes), and achieves the state-of-the-art reconstruction performance (1.17 rFID) on ImageNet 256 times 256. Furthermore, we explore its application in plain auto-regressive models and validate scalability properties. To assist auto-regressive models in predicting with a super-large vocabulary, we factorize it into two sub-vocabulary of different sizes by asymmetric token factorization, and further introduce "next sub-token prediction" to enhance sub-token interaction for better generation quality. We release all models and codes to foster innovation and creativity in the field of auto-regressive visual generation.

View Paper