Rethinking Training Dynamics in Scale-wise Autoregressive Generation
Gengze Zhou, Chongjian Ge, Hao Tan, Feng Liu, Yicong Hong
2025-12-09
Summary
This paper focuses on improving the quality of images created by a specific type of AI model called autoregressive generative models, which build images step-by-step from blurry to detailed. These models are getting really good at making realistic images, but they still have some issues with consistency and quality.
What's the problem?
The main problem is something called 'exposure bias'. Imagine you're learning to draw, and your teacher only gives you feedback on your *final* drawing, not on the sketches you made along the way. That's similar to what happens with these AI models. During training, they get perfect information, but when creating new images, they have to rely on their own, sometimes imperfect, previous steps. This mismatch leads to errors. Also, some stages of the image creation process are much harder for the AI to learn than others, creating an imbalance that hurts overall quality.
What's the solution?
The researchers developed a method called Self-Autoregressive Refinement, or SAR. It has two main parts. First, they use 'Stagger-Scale Rollout' which is like letting the model practice by showing it what its own intermediate steps look like, so it learns to deal with its own imperfections. Second, they use 'Contrastive Student-Forcing Loss' which provides extra guidance to the model when it's working with its own generated parts of the image, helping it stay on track and avoid getting confused. Essentially, SAR helps the model learn from its own mistakes and balance the difficulty across different stages of image creation.
Why it matters?
This research is important because it offers a way to significantly improve existing image-generating AI models without needing to retrain them from scratch. SAR is efficient, meaning it doesn't require a huge amount of computing power, and it's scalable, meaning it can be applied to larger and more complex models. This makes it a practical and reliable method for enhancing the quality of AI-generated images, which has implications for many fields like art, design, and entertainment.
Abstract
Recent advances in autoregressive (AR) generative models have produced increasingly powerful systems for media synthesis. Among them, next-scale prediction has emerged as a popular paradigm, where models generate images in a coarse-to-fine manner. However, scale-wise AR models suffer from exposure bias, which undermines generation quality. We identify two primary causes of this issue: (1) train-test mismatch, where the model must rely on its own imperfect predictions during inference, and (2) imbalance in scale-wise learning difficulty, where certain scales exhibit disproportionately higher optimization complexity. Through a comprehensive analysis of training dynamics, we propose Self-Autoregressive Refinement (SAR) to address these limitations. SAR introduces a Stagger-Scale Rollout (SSR) mechanism that performs lightweight autoregressive rollouts to expose the model to its own intermediate predictions, thereby aligning train-test patterns, and a complementary Contrastive Student-Forcing Loss (CSFL) that provides adequate supervision for self-generated contexts to ensure stable training. Experimental results show that applying SAR to pretrained AR models consistently improves generation quality with minimal computational overhead. For instance, SAR yields a 5.2% FID reduction on FlexVAR-d16 trained on ImageNet 256 within 10 epochs (5 hours on 32xA100 GPUs). Given its efficiency, scalability, and effectiveness, we expect SAR to serve as a reliable post-training method for visual autoregressive generation.