VAR RL Done Right: Tackling Asynchronous Policy Conflicts in Visual Autoregressive Generation
Shikun Sun, Liao Qu, Huichao Zhang, Yiheng Liu, Yangyang Song, Xian Li, Xu Wang, Yi Jiang, Daniel K. Du, Xinglong Wu, Jia Jia
2026-01-06
Summary
This paper focuses on improving how images are created by a specific type of AI model called Visual AutoRegressive (VAR) models, which build images step-by-step in a complex way.
What's the problem?
VAR models are tricky to train, especially when you want them to learn through trial and error like reinforcement learning. Because they build images in a non-standard order, different parts of the model can 'fight' with each other during training, leading to unstable results and images that don't quite match what you want them to look like. This happens because the model is trying to improve different parts of the image at different times, and it's hard to figure out which changes are actually helpful.
What's the solution?
The researchers developed a new training method that builds on an existing technique called Group Relative Policy Optimization (GRPO). They added three key improvements: first, they give the model a small reward early on to help it get started in the right direction. Second, they carefully adjust how much credit each step in the image creation process gets for the final result. Finally, they use a clever trick, inspired by how humans learn from feedback, to make sure that improvements to one part of the image don't accidentally mess up other parts, both in space and over time.
Why it matters?
This work is important because it makes VAR models much more reliable and effective. By solving the training instability problem, the researchers enable these models to create higher-quality images and better achieve the goals they are set, opening the door for more advanced image generation applications.
Abstract
Visual generation is dominated by three paradigms: AutoRegressive (AR), diffusion, and Visual AutoRegressive (VAR) models. Unlike AR and diffusion, VARs operate on heterogeneous input structures across their generation steps, which creates severe asynchronous policy conflicts. This issue becomes particularly acute in reinforcement learning (RL) scenarios, leading to unstable training and suboptimal alignment. To resolve this, we propose a novel framework to enhance Group Relative Policy Optimization (GRPO) by explicitly managing these conflicts. Our method integrates three synergistic components: 1) a stabilizing intermediate reward to guide early-stage generation; 2) a dynamic time-step reweighting scheme for precise credit assignment; and 3) a novel mask propagation algorithm, derived from principles of Reward Feedback Learning (ReFL), designed to isolate optimization effects both spatially and temporally. Our approach demonstrates significant improvements in sample quality and objective alignment over the vanilla GRPO baseline, enabling robust and effective optimization for VAR models.