Reinforcing Diffusion Models by Direct Group Preference Optimization

Yihong Luo, Tianyang Hu, Jing Tang

2025-10-10

Reinforcing Diffusion Models by Direct Group Preference Optimization

Summary

This paper introduces a new method, called Direct Group Preference Optimization (DGPO), for improving diffusion models using reinforcement learning, specifically focusing on making the process much faster and more effective.

What's the problem?

Reinforcement learning techniques have been successful with large language models, but applying them to diffusion models is tricky. A common method, GRPO, needs a random element in how it generates samples, but the fastest ways to create images with diffusion models are actually very precise and don't have that randomness. Previous attempts to add randomness were slow and inefficient, hindering the learning process.

What's the solution?

The researchers developed DGPO, which is a new way to train diffusion models that doesn't rely on the traditional 'policy-gradient' approach. Instead, it directly learns from how people prefer different images within groups, using relative comparisons. This clever design allows the use of fast, precise image generation techniques while still benefiting from reinforcement learning, leading to quicker training.

Why it matters?

DGPO is a significant step forward because it dramatically speeds up the training of diffusion models – about 20 times faster than current methods – and produces better results, both on images similar to the training data and on completely new types of images. This means we can create higher-quality images more efficiently.

Abstract

While reinforcement learning methods such as Group Relative Preference Optimization (GRPO) have significantly enhanced Large Language Models, adapting them to diffusion models remains challenging. In particular, GRPO demands a stochastic policy, yet the most cost-effective diffusion samplers are based on deterministic ODEs. Recent work addresses this issue by using inefficient SDE-based samplers to induce stochasticity, but this reliance on model-agnostic Gaussian noise leads to slow convergence. To resolve this conflict, we propose Direct Group Preference Optimization (DGPO), a new online RL algorithm that dispenses with the policy-gradient framework entirely. DGPO learns directly from group-level preferences, which utilize relative information of samples within groups. This design eliminates the need for inefficient stochastic policies, unlocking the use of efficient deterministic ODE samplers and faster training. Extensive results show that DGPO trains around 20 times faster than existing state-of-the-art methods and achieves superior performance on both in-domain and out-of-domain reward metrics. Code is available at https://github.com/Luo-Yihong/DGPO.

View Paper