Understand Before You Generate: Self-Guided Training for Autoregressive Image Generation

Xiaoyu Yue, Zidong Wang, Yuqing Wang, Wenlong Zhang, Xihui Liu, Wanli Ouyang, Lei Bai, Luping Zhou

2025-09-19

Understand Before You Generate: Self-Guided Training for Autoregressive Image Generation

Summary

This research focuses on improving how well computer programs can 'understand' images when they are built to *create* images. It looks at a specific type of image-generating program called an autoregressive model, which originally came from the field of language processing.

What's the problem?

Autoregressive models, while good at creating things step-by-step like writing sentences, struggle with images because images aren't naturally processed in a simple, sequential order. The paper identifies three main issues: the model focuses too much on small, local details instead of the bigger picture, it creates inconsistencies between different parts of the image as it builds it, and it doesn't fully grasp that an object remains the same even if it's moved around in the image. Essentially, the model doesn't 'get' the overall meaning of what it's drawing.

What's the solution?

The researchers developed a new training method called Self-guided Training for AutoRegressive models, or ST-AR. This method adds extra tasks during training that force the model to learn more about the image's content *without* needing to rely on pre-existing image understanding tools. It's like giving the model practice quizzes to check its understanding as it learns to draw.

Why it matters?

This work is important because it significantly improves the quality of images generated by autoregressive models and, more importantly, makes them better at actually 'understanding' what they are creating. The improvements are substantial, showing a large decrease in errors when evaluating the generated images, and it does this without needing complex pre-training, making it a more efficient approach.

Abstract

Recent studies have demonstrated the importance of high-quality visual representations in image generation and have highlighted the limitations of generative models in image understanding. As a generative paradigm originally designed for natural language, autoregressive models face similar challenges. In this work, we present the first systematic investigation into the mechanisms of applying the next-token prediction paradigm to the visual domain. We identify three key properties that hinder the learning of high-level visual semantics: local and conditional dependence, inter-step semantic inconsistency, and spatial invariance deficiency. We show that these issues can be effectively addressed by introducing self-supervised objectives during training, leading to a novel training framework, Self-guided Training for AutoRegressive models (ST-AR). Without relying on pre-trained representation models, ST-AR significantly enhances the image understanding ability of autoregressive models and leads to improved generation quality. Specifically, ST-AR brings approximately 42% FID improvement for LlamaGen-L and 49% FID improvement for LlamaGen-XL, while maintaining the same sampling strategy.

View Paper