V-GRPO: Online Reinforcement Learning for Denoising Generative Models Is Easier than You Think
Bingda Tang, Yuhui Zhang, Xiaohan Wang, Jiayuan Mao, Ludwig Schmidt, Serena Yeung-Levy
2026-04-29
Summary
This paper focuses on improving how we teach AI image generators to create pictures that people actually like, or that meet specific goals. It's about making these generators better at following instructions and producing high-quality results.
What's the problem?
Training AI image generators to do what we want is tricky. One common method, called reinforcement learning, is good in theory but doesn't work well directly because it's hard for the AI to judge how good its own creations are. Previous attempts either focused on slowly improving the AI through many tries, which is slow, or used shortcuts to estimate quality, but those shortcuts didn't produce great images. Essentially, it's hard to give the AI clear feedback on its artwork.
What's the solution?
The researchers found a way to make the shortcut method for estimating quality both more reliable and faster. They developed a new technique called V-GRPO that carefully adjusts how the AI learns from these estimated quality scores, reducing errors and making the learning process more efficient. It's like giving the AI more precise guidance without making it take forever to learn.
Why it matters?
This work is important because it allows us to train AI image generators much more quickly and effectively. V-GRPO creates better images than previous methods and does so at a significantly faster rate, meaning we can create more advanced and personalized AI art tools in the future. It also simplifies the training process, making it easier for others to build on this research.
Abstract
Aligning denoising generative models with human preferences or verifiable rewards remains a key challenge. While policy-gradient online reinforcement learning (RL) offers a principled post-training framework, its direct application is hindered by the intractable likelihoods of these models. Prior work therefore either optimizes an induced Markov decision process (MDP) over sampling trajectories, which is stable but inefficient, or uses likelihood surrogates based on the diffusion evidence lower bound (ELBO), which have so far underperformed on visual generation. Our key insight is that the ELBO-based approach can, in fact, be made both stable and efficient. By reducing surrogate variance and controlling gradient steps, we show that this approach can beat MDP-based methods. To this end, we introduce Variational GRPO (V-GRPO), a method that integrates ELBO-based surrogates with the Group Relative Policy Optimization (GRPO) algorithm, alongside a set of simple yet essential techniques. Our method is easy to implement, aligns with pretraining objectives, and avoids the limitations of MDP-based methods. V-GRPO achieves state-of-the-art performance in text-to-image synthesis, while delivering a 2times speedup over MixGRPO and a 3times speedup over DiffusionNFT.