FP4 Explore, BF16 Train: Diffusion Reinforcement Learning via Efficient Rollout Scaling

Yitong Li, Junsong Chen, Shuchen Xue, Pengcuo Zeren, Siyuan Fu, Dinghao Yang, Yangyang Tang, Junjie Bai, Ping Luo, Song Han, Enze Xie

2026-04-09

FP4 Explore, BF16 Train: Diffusion Reinforcement Learning via Efficient Rollout Scaling

Summary

This paper focuses on making it faster and cheaper to improve AI image generators, specifically those that create images from text prompts, so they better match what people actually want. They do this by using a technique called reinforcement learning, where the AI learns by getting 'rewards' based on how good its images are.

What's the problem?

Training these AI image generators to create images people like requires a lot of trial and error, which means generating many images and evaluating them. This process, called 'rollout,' is very computationally expensive, especially for powerful, large models. While using a simplified number format (FP4) can speed things up, it often leads to a loss in image quality and makes the training less effective.

What's the solution?

The researchers developed a new framework called Sol-RL, which cleverly combines the speed of FP4 with the accuracy of a more precise number format (BF16). It works in two steps: first, it quickly generates a large pool of candidate images using FP4. Then, it carefully selects the best images from this pool and refines the AI model using the more accurate BF16 format. This way, they get the benefits of both – speed and quality – without sacrificing performance.

Why it matters?

This research is important because it makes it practical to train much better AI image generators without needing massive amounts of computing power. By accelerating the training process up to 4.64 times, it unlocks the potential for creating AI that generates images that are more aligned with human preferences, and at a significantly reduced cost.

Abstract

Reinforcement-Learning-based post-training has recently emerged as a promising paradigm for aligning text-to-image diffusion models with human preferences. In recent studies, increasing the rollout group size yields pronounced performance improvements, indicating substantial room for further alignment gains. However, scaling rollouts on large-scale foundational diffusion models (e.g., FLUX.1-12B) imposes a heavy computational burden. To alleviate this bottleneck, we explore the integration of FP4 quantization into Diffusion RL rollouts. Yet, we identify that naive quantized pipelines inherently introduce risks of performance degradation. To overcome this dilemma between efficiency and training integrity, we propose Sol-RL (Speed-of-light RL), a novel FP4-empowered Two-stage Reinforcement Learning framework. First, we utilize high-throughput NVFP4 rollouts to generate a massive candidate pool and extract a highly contrastive subset. Second, we regenerate these selected samples in BF16 precision and optimize the policy exclusively on them. By decoupling candidate exploration from policy optimization, Sol-RL integrates the algorithmic mechanisms of rollout scaling with the system-level throughput gains of NVFP4. This synergistic algorithm-hardware design effectively accelerates the rollout phase while reserving high-fidelity samples for optimization. We empirically demonstrate that our framework maintains the training integrity of BF16 precision pipeline while fully exploiting the throughput gains enabled by FP4 arithmetic. Extensive experiments across SANA, FLUX.1, and SD3.5-L substantiate that our approach delivers superior alignment performance across multiple metrics while accelerating training convergence by up to 4.64times, unlocking the power of massive rollout scaling at a fraction of the cost.

View Paper