MonoFormer: One Transformer for Both Diffusion and Autoregression
Chuyang Zhao, Yuxing Song, Wenhao Wang, Haocheng Feng, Errui Ding, Yifan Sun, Xinyan Xiao, Jingdong Wang
2024-09-25

Summary
This paper introduces MonoFormer, a new model that combines two different processes—autoregression and diffusion—into a single transformer framework. This allows the model to generate both text and images more efficiently.
What's the problem?
Most existing models for generating text and images use separate systems or try to adapt one system for both tasks, which can be inefficient and may not work well. This separation can lead to limitations in how well the models perform when generating content across different formats, such as text and images.
What's the solution?
To solve this problem, the researchers developed MonoFormer, which uses one transformer model to handle both text generation (autoregression) and image generation (diffusion). They found that the training processes for these two tasks are quite similar, so they could share the same model. The key difference is in how they handle attention in the training process: autoregression uses a causal attention mask while diffusion uses a bidirectional mask. Their experiments showed that MonoFormer can generate high-quality images and maintain good text generation capabilities, achieving results comparable to the best existing methods.
Why it matters?
This research is significant because it simplifies the process of creating AI models that can generate both text and images. By using a single model for both tasks, MonoFormer can potentially save time and resources while improving performance. This advancement could lead to more powerful AI applications in areas like content creation, gaming, and virtual reality, where both text and visuals are essential.
Abstract
Most existing multimodality methods use separate backbones for autoregression-based discrete text generation and diffusion-based continuous visual generation, or the same backbone by discretizing the visual data to use autoregression for both text and visual generation. In this paper, we propose to study a simple idea: share one transformer for both autoregression and diffusion. The feasibility comes from two main aspects: (i) Transformer is successfully applied to diffusion for visual generation, and (ii) transformer training for autoregression and diffusion is very similar, and the difference merely lies in that diffusion uses bidirectional attention mask and autoregression uses causal attention mask. Experimental results show that our approach achieves comparable image generation performance to current state-of-the-art methods as well as maintains the text generation capability. The project is publicly available at https://monoformer.github.io/.