SimpleAR: Pushing the Frontier of Autoregressive Visual Generation through Pretraining, SFT, and RL

Junke Wang, Zhi Tian, Xun Wang, Xinyu Zhang, Weilin Huang, Zuxuan Wu, Yu-Gang Jiang

2025-04-16

SimpleAR: Pushing the Frontier of Autoregressive Visual Generation
through Pretraining, SFT, and RL

Summary

This paper talks about SimpleAR, a new and straightforward AI system for creating images from text that uses an autoregressive approach, meaning it builds the image step by step, and manages to produce very high-quality pictures even though it's not a huge model.

What's the problem?

The problem is that most powerful image generation models are extremely large and complex, which makes them hard to train, expensive to run, and sometimes slow to respond. These big models also don't always create images that match the text prompts well or look visually appealing.

What's the solution?

The researchers built SimpleAR, which uses a smaller model with only 0.5 billion parameters and relies on careful pretraining, supervised fine-tuning, and reinforcement learning to get the best results. By optimizing both how the model learns and how it generates images, SimpleAR is able to create images that look great and match the text descriptions closely, while also being faster and more efficient than larger models.

Why it matters?

This matters because it shows that you don't need a giant, complicated model to get top-quality images from text. SimpleAR makes advanced image generation more accessible, affordable, and practical for more people and companies, opening up new possibilities for creativity and technology.

Abstract

A vanilla autoregressive visual generation framework with 0.5B parameters achieves high-fidelity image generation, competitive results on text-to-image benchmarks, and improved aesthetics and prompt alignment through optimized training and inference techniques.

View Paper