< Explain other AI papers

RLFR: Extending Reinforcement Learning for LLMs with Flow Environment

Jinghao Zhang, Naishan Zheng, Ruilin Li, Dongzhou Cheng, Zheming Liang, Feng Zhao, Jiaqi Wang

2025-10-14

RLFR: Extending Reinforcement Learning for LLMs with Flow Environment

Summary

This paper introduces a new method, RLFR, to improve how Large Language Models (LLMs) learn to reason by giving them better feedback during the learning process. It builds on a technique called Reinforcement Learning with Verifiable Rewards (RLVR), which aims to make LLMs more reliable thinkers.

What's the problem?

Current RLVR methods often struggle because they rely on simple 'correct' or 'incorrect' feedback, which can cause the model to miss out on potentially good ways to solve problems. Getting detailed, high-quality feedback on *how* a model is reasoning is expensive and time-consuming, as it requires humans to carefully evaluate each step. Existing attempts to use automated signals to help with this feedback haven't fully utilized the information available within the model itself.

What's the solution?

The researchers propose RLFR, which creates a 'flow field' inside the LLM's internal workings – specifically, its 'latent space'. Think of this like mapping out the model's thought process. They use data from both successful past attempts and new attempts to build this map. Then, they reward the model for exploring diverse and potentially useful paths within this map, measuring how much its 'thinking' deviates from the established flow. This encourages the model to try new things and potentially discover better reasoning strategies.

Why it matters?

This work is important because it offers a more efficient way to train LLMs to reason effectively. By leveraging the model's internal representation of information, RLFR reduces the need for expensive human feedback. It shows that LLMs have a lot of untapped potential in their internal 'thinking' and provides a new way to unlock it, leading to improvements in both language-based and multimodal reasoning tasks.

Abstract

Reinforcement Learning with Verifiable Rewards (RLVR) has recently emerged as a promising framework for improving reasoning abilities in Large Language Models (LLMs). However, policy optimized with binary verification prone to overlook potential valuable exploration in reasoning trajectory. In view of heavy annotation cost of golden Process Reward Models (PRMs), recent works attempt using auxiliary signals for reward shaping of process tokens, involving entropy and likelihood collected from logit space. In this work, we offer a novel perspective on shaping RLVR with flow rewards derived from latent space, and propose RLFR, where the flow fields of model latents are constructed from either off-policy high-quality data and on-policy rejection sampling data, and the velocity deviations of policy latents within it are quantified to serve as a reward signal. RLFR first demonstrates that a well-established flow field can be a sound environment for reward signal collection, highlighting the expressive latent space is much underexplored. Moreover, RLFR is able to compress any off-policy expert data as reference for constituting reward signals, and we show that the efficient context dependence compressed within the hidden states are utilized, rather than individual token-level denotation for context comprehending. Experiments on both language and multimodal reasoning benchmarks demonstrate the reliability of flow rewards, and suggesting a promising paradigm for reward shaping with auxiliary signals.