Autoregressive Models in Vision: A Survey
Jing Xiong, Gongye Liu, Lun Huang, Chengyue Wu, Taiqiang Wu, Yao Mu, Yuan Yao, Hui Shen, Zhongwei Wan, Jinfa Huang, Chaofan Tao, Shen Yan, Huaxiu Yao, Lingpeng Kong, Hongxia Yang, Mi Zhang, Guillermo Sapiro, Jiebo Luo, Ping Luo, Ngai Wong
2024-11-12

Summary
This paper surveys the use of autoregressive models in computer vision, showing how these models can generate high-quality images and videos by predicting visual data in a step-by-step manner.
What's the problem?
Autoregressive models have been very successful in processing language, but applying them to images and videos presents unique challenges. In computer vision, visual data can be represented in various ways (like pixels or tokens), which is different from the linear structure of language. This complexity makes it difficult to effectively use autoregressive models for tasks such as image generation and understanding.
What's the solution?
The authors review existing literature on autoregressive models in vision, categorizing them into three main types: pixel-based, token-based, and scale-based models. They also explore how these models relate to other generative methods and discuss their applications in areas like image generation, video creation, and even 3D modeling. The survey highlights the strengths and weaknesses of current approaches and suggests future research directions to improve these models.
Why it matters?
This research is important because it helps researchers and developers understand how autoregressive models can be adapted for visual tasks. By providing a comprehensive overview of the field, it paves the way for advancements in generating high-quality images and videos, which can be applied in various industries such as entertainment, virtual reality, and medical imaging.
Abstract
Autoregressive modeling has been a huge success in the field of natural language processing (NLP). Recently, autoregressive models have emerged as a significant area of focus in computer vision, where they excel in producing high-quality visual content. Autoregressive models in NLP typically operate on subword tokens. However, the representation strategy in computer vision can vary in different levels, i.e., pixel-level, token-level, or scale-level, reflecting the diverse and hierarchical nature of visual data compared to the sequential structure of language. This survey comprehensively examines the literature on autoregressive models applied to vision. To improve readability for researchers from diverse research backgrounds, we start with preliminary sequence representation and modeling in vision. Next, we divide the fundamental frameworks of visual autoregressive models into three general sub-categories, including pixel-based, token-based, and scale-based models based on the strategy of representation. We then explore the interconnections between autoregressive models and other generative models. Furthermore, we present a multi-faceted categorization of autoregressive models in computer vision, including image generation, video generation, 3D generation, and multi-modal generation. We also elaborate on their applications in diverse domains, including emerging domains such as embodied AI and 3D medical AI, with about 250 related references. Finally, we highlight the current challenges to autoregressive models in vision with suggestions about potential research directions. We have also set up a Github repository to organize the papers included in this survey at: https://github.com/ChaofanTao/Autoregressive-Models-in-Vision-Survey.