TreeGRPO: Tree-Advantage GRPO for Online RL Post-Training of Diffusion Models

Zheng Ding, Weirui Ye

2025-12-10

TreeGRPO: Tree-Advantage GRPO for Online RL Post-Training of Diffusion Models

Summary

This paper introduces a new method called TreeGRPO to make improving AI image generators with human feedback much faster and more efficient.

What's the problem?

Currently, a technique called reinforcement learning is used to fine-tune AI image generators to better match what people want, but it takes a huge amount of computing power and time, making it difficult for many researchers and developers to use. Essentially, getting these AI models to consistently create images people like is really expensive.

What's the solution?

TreeGRPO works by cleverly organizing the process of trying out different image variations. Instead of starting from scratch each time, it builds a 'tree' of possibilities, reusing parts of previous attempts. Imagine you're sketching – you don't redraw the whole thing if you just want to change the eyes; you build on what you already have. TreeGRPO does something similar, making it learn faster and more efficiently. It also figures out *exactly* which parts of the image generation process are contributing to good or bad results, allowing for more precise improvements.

Why it matters?

This research is important because it makes aligning AI image generators with human preferences much more practical. By significantly reducing the computational cost, more people can work on improving these models, leading to better and more personalized AI-generated images. It’s a step towards making AI art tools more accessible and responsive to what users actually want.

Abstract

Reinforcement learning (RL) post-training is crucial for aligning generative models with human preferences, but its prohibitive computational cost remains a major barrier to widespread adoption. We introduce TreeGRPO, a novel RL framework that dramatically improves training efficiency by recasting the denoising process as a search tree. From shared initial noise samples, TreeGRPO strategically branches to generate multiple candidate trajectories while efficiently reusing their common prefixes. This tree-structured approach delivers three key advantages: (1) High sample efficiency, achieving better performance under same training samples (2) Fine-grained credit assignment via reward backpropagation that computes step-specific advantages, overcoming the uniform credit assignment limitation of trajectory-based methods, and (3) Amortized computation where multi-child branching enables multiple policy updates per forward pass. Extensive experiments on both diffusion and flow-based models demonstrate that TreeGRPO achieves 2.4times faster training while establishing a superior Pareto frontier in the efficiency-reward trade-off space. Our method consistently outperforms GRPO baselines across multiple benchmarks and reward models, providing a scalable and effective pathway for RL-based visual generative model alignment. The project website is available at treegrpo.github.io.

View Paper