E-GRPO: High Entropy Steps Drive Effective Reinforcement Learning for Flow Models

Shengjun Zhang, Zhang Zhang, Chensheng Dai, Yueqi Duan

2026-01-08

E-GRPO: High Entropy Steps Drive Effective Reinforcement Learning for Flow Models

Summary

This paper focuses on improving how well AI models learn what humans want, specifically when using a technique called 'flow matching' and 'reinforcement learning'. It tackles a challenge where the AI gets confused when trying to understand feedback during the learning process.

What's the problem?

When training AI using human preferences, the AI explores different options by making random changes. However, if these changes happen over many steps, the AI receives unclear or weak signals about whether it's getting closer to what humans like. Some steps with very predictable changes don't help the AI explore effectively, while steps with high randomness can make it hard to pinpoint what's working and what isn't. Essentially, the AI struggles to learn from noisy feedback when making many small adjustments.

What's the solution?

The researchers developed a new method called E-GRPO. This method focuses on making the random changes more strategic. It combines several small, predictable steps into one larger, more random step, which helps the AI explore more effectively. They also developed a way to better evaluate how good each step is, by comparing the AI's performance to other AIs that took similar combined steps. This helps the AI understand which changes are actually leading to better results.

Why it matters?

This research is important because it makes AI systems that learn from human preferences more reliable and efficient. By improving how the AI explores and understands feedback, we can create AI that better aligns with human values and goals, leading to more useful and trustworthy AI applications.

Abstract

Recent reinforcement learning has enhanced the flow matching models on human preference alignment. While stochastic sampling enables the exploration of denoising directions, existing methods which optimize over multiple denoising steps suffer from sparse and ambiguous reward signals. We observe that the high entropy steps enable more efficient and effective exploration while the low entropy steps result in undistinguished roll-outs. To this end, we propose E-GRPO, an entropy aware Group Relative Policy Optimization to increase the entropy of SDE sampling steps. Since the integration of stochastic differential equations suffer from ambiguous reward signals due to stochasticity from multiple steps, we specifically merge consecutive low entropy steps to formulate one high entropy step for SDE sampling, while applying ODE sampling on other steps. Building upon this, we introduce multi-step group normalized advantage, which computes group-relative advantages within samples sharing the same consolidated SDE denoising step. Experimental results on different reward settings have demonstrated the effectiveness of our methods.

View Paper