Fluid: Scaling Autoregressive Text-to-image Generative Models with Continuous Tokens

Lijie Fan, Tianhong Li, Siyang Qin, Yuanzhen Li, Chen Sun, Michael Rubinstein, Deqing Sun, Kaiming He, Yonglong Tian

2024-10-18

Fluid: Scaling Autoregressive Text-to-image Generative Models with Continuous Tokens

Summary

This paper discusses Fluid, a new model that improves the process of generating images from text by using continuous tokens and a random generation order, resulting in higher quality visuals.

What's the problem?

When it comes to creating images from text descriptions, previous models have not been as effective as those used for generating text. One major issue is that scaling up these models doesn't always lead to better results. Additionally, there are two important factors to consider: whether the model uses discrete tokens (which are fixed and separate) or continuous tokens (which are fluid and can represent more information), and how these tokens are generated—either in a fixed order or randomly. These factors can significantly affect the quality of the generated images.

What's the solution?

The authors of this paper conducted experiments to determine how these factors influence image generation. They found that models using continuous tokens produced much better visuals than those using discrete tokens. They also discovered that generating tokens in a random order led to better performance compared to a fixed order. Based on these findings, they developed Fluid, a new autoregressive model that generates images using continuous tokens in a random order. Fluid achieved impressive results, including a state-of-the-art score on popular benchmarks for image quality.

Why it matters?

This research is important because it sets new standards for how AI can generate images from text. By improving the methods used for image generation, Fluid can help create more realistic and visually appealing images, which could be useful in various fields like entertainment, advertising, and education. This advancement also encourages further exploration into optimizing AI models for better performance.

Abstract

Scaling up autoregressive models in vision has not proven as beneficial as in large language models. In this work, we investigate this scaling problem in the context of text-to-image generation, focusing on two critical factors: whether models use discrete or continuous tokens, and whether tokens are generated in a random or fixed raster order using BERT- or GPT-like transformer architectures. Our empirical results show that, while all models scale effectively in terms of validation loss, their evaluation performance -- measured by FID, GenEval score, and visual quality -- follows different trends. Models based on continuous tokens achieve significantly better visual quality than those using discrete tokens. Furthermore, the generation order and attention mechanisms significantly affect the GenEval score: random-order models achieve notably better GenEval scores compared to raster-order models. Inspired by these findings, we train Fluid, a random-order autoregressive model on continuous tokens. Fluid 10.5B model achieves a new state-of-the-art zero-shot FID of 6.16 on MS-COCO 30K, and 0.69 overall score on the GenEval benchmark. We hope our findings and results will encourage future efforts to further bridge the scaling gap between vision and language models.

View Paper