ACDiT: Interpolating Autoregressive Conditional Modeling and Diffusion Transformer
Jinyi Hu, Shengding Hu, Yuxuan Song, Yufei Huang, Mingxuan Wang, Hao Zhou, Zhiyuan Liu, Wei-Ying Ma, Maosong Sun
2024-12-11

Summary
This paper talks about ACDiT, a new model that combines two different approaches to improve how AI generates visual content, such as images and videos, by using both autoregressive modeling and diffusion techniques.
What's the problem?
Current models for generating visual content often use either autoregressive methods, which predict future content based on past data, or diffusion models, which create images through a gradual process. However, these methods have different strengths and weaknesses, making it hard to create a unified model that can generate high-quality visuals efficiently.
What's the solution?
The authors propose ACDiT (Autoregressive blockwise Conditional Diffusion Transformer), which allows for flexibility in how visual information is generated. By adjusting the size of the blocks in the diffusion process, ACDiT can switch between using detailed autoregressive predictions and broader diffusion techniques. This model can effectively generate images and videos while also being useful for understanding visual content, making it adaptable for various tasks.
Why it matters?
This research is important because it helps advance the field of AI in generating visuals by combining the best features of different modeling techniques. ACDiT's ability to handle both image generation and understanding makes it a promising tool for future applications in art, design, and other areas where visual content is essential.
Abstract
The recent surge of interest in comprehensive multimodal models has necessitated the unification of diverse modalities. However, the unification suffers from disparate methodologies. Continuous visual generation necessitates the full-sequence diffusion-based approach, despite its divergence from the autoregressive modeling in the text domain. We posit that autoregressive modeling, i.e., predicting the future based on past deterministic experience, remains crucial in developing both a visual generation model and a potential unified multimodal model. In this paper, we explore an interpolation between the autoregressive modeling and full-parameters diffusion to model visual information. At its core, we present ACDiT, an Autoregressive blockwise Conditional Diffusion Transformer, where the block size of diffusion, i.e., the size of autoregressive units, can be flexibly adjusted to interpolate between token-wise autoregression and full-sequence diffusion. ACDiT is easy to implement, as simple as creating a Skip-Causal Attention Mask (SCAM) during training. During inference, the process iterates between diffusion denoising and autoregressive decoding that can make full use of KV-Cache. We verify the effectiveness of ACDiT on image and video generation tasks. We also demonstrate that benefitted from autoregressive modeling, ACDiT can be seamlessly used in visual understanding tasks despite being trained on the diffusion objective. The analysis of the trade-off between autoregressive modeling and diffusion demonstrates the potential of ACDiT to be used in long-horizon visual generation tasks. These strengths make it promising as the backbone of future unified models.