FlowRL: Matching Reward Distributions for LLM Reasoning
Xuekai Zhu, Daixuan Cheng, Dinghuai Zhang, Hengli Li, Kaiyan Zhang, Che Jiang, Youbang Sun, Ermo Hua, Yuxin Zuo, Xingtai Lv, Qizheng Zhang, Lin Chen, Fanghao Shao, Bo Xue, Yunchong Song, Zhenjie Yang, Ganqu Cui, Ning Ding, Jianfeng Gao, Xiaodong Liu, Bowen Zhou, Hongyuan Mei
2025-09-19
Summary
This paper introduces a new approach called FlowRL for training large language models using reinforcement learning, focusing on encouraging a wider range of reasoning approaches instead of just the ones that immediately lead to high rewards.
What's the problem?
When training powerful language models to solve complex problems like math or coding, traditional reinforcement learning methods often get stuck focusing on the most obvious solutions. They maximize rewards, but this can lead to the model missing out on other, potentially valid, but less frequent ways to arrive at the correct answer. This limits the model's ability to think creatively and generalize to new situations, reducing diversity in its reasoning.
What's the solution?
FlowRL tackles this by changing how rewards are used. Instead of simply trying to get the highest possible reward, it tries to match the overall *distribution* of possible rewards. It does this by converting the reward signal into a target distribution and then adjusting the model's behavior to match that distribution, using a technique that encourages exploration of different reasoning paths. Essentially, it balances the 'flow' of possibilities, preventing the model from getting fixated on just a few high-reward options.
Why it matters?
This research is important because it shows a better way to train language models to be more versatile and capable of complex reasoning. By promoting diversity in the reasoning process, FlowRL leads to significant improvements in performance on challenging tasks like math and coding, and suggests that focusing on the full range of possible solutions, not just the most rewarding ones, is key to building more intelligent and adaptable AI systems.
Abstract
We propose FlowRL: matching the full reward distribution via flow balancing instead of maximizing rewards in large language model (LLM) reinforcement learning (RL). Recent advanced reasoning models adopt reward-maximizing methods (\eg, PPO and GRPO), which tend to over-optimize dominant reward signals while neglecting less frequent but valid reasoning paths, thus reducing diversity. In contrast, we transform scalar rewards into a normalized target distribution using a learnable partition function, and then minimize the reverse KL divergence between the policy and the target distribution. We implement this idea as a flow-balanced optimization method that promotes diverse exploration and generalizable reasoning trajectories. We conduct experiments on math and code reasoning tasks: FlowRL achieves a significant average improvement of 10.0% over GRPO and 5.1% over PPO on math benchmarks, and performs consistently better on code reasoning tasks. These results highlight reward distribution-matching as a key step toward efficient exploration and diverse reasoning in LLM reinforcement learning.