ReasonFlux-PRM: Trajectory-Aware PRMs for Long Chain-of-Thought Reasoning in LLMs

Jiaru Zou, Ling Yang, Jingwen Gu, Jiahao Qiu, Ke Shen, Jingrui He, Mengdi Wang

2025-06-24

ReasonFlux-PRM: Trajectory-Aware PRMs for Long Chain-of-Thought
Reasoning in LLMs

Summary

This paper talks about ReasonFlux-PRM, a new type of Process Reward Model that helps large language models improve their step-by-step thinking by evaluating the entire reasoning path more carefully.

What's the problem?

The problem is that current reward models usually focus on judging only single steps or the final answer, which can miss important details about how well the model thinks through a problem over multiple steps.

What's the solution?

The researchers designed a trajectory-aware Process Reward Model that supervises reasoning both at each step and across the entire sequence of steps, improving learning during training methods like model distillation, reinforcement learning, and testing at different scales.

Why it matters?

This matters because it helps AI systems think more logically and accurately over long and complicated problems, making them better at tasks that require careful reasoning, like math or decision-making.

Abstract

ReasonFlux-PRM, a novel trajectory-aware Process Reward Model, evaluates reasoning traces with step-level and trajectory-level supervision, enhancing performance in model distillation, reinforcement learning, and test-time scaling.

View Paper