PVPO: Pre-Estimated Value-Based Policy Optimization for Agentic Reasoning

Wenfeng Feng, Penghong Zhao, Guochao Jiang, Chuzhan Hao, Yuewei Zhang, Hao Wang

2025-09-02

PVPO: Pre-Estimated Value-Based Policy Optimization for Agentic Reasoning

Summary

This paper introduces a new reinforcement learning technique called PVPO, which aims to improve how AI agents learn to make decisions in complex environments.

What's the problem?

Existing 'critic-free' reinforcement learning methods, which are good at tackling difficult tasks, often struggle because they rely on comparing many different attempts to figure out which actions are best. This comparison process can be slow and can lead the agent to get stuck making only slightly better choices instead of finding the truly optimal strategy. Essentially, it's like trying to improve by only looking at very similar options, which limits progress and takes a lot of computing power.

What's the solution?

PVPO solves this by using a 'reference model' to predict what a good outcome should look like *before* the agent actually tries anything. This prediction acts as a benchmark. By comparing the agent’s performance to this benchmark, instead of just comparing different attempts to each other, PVPO avoids getting stuck in local optima and reduces the number of trials needed. Additionally, the reference model helps identify which situations are most valuable to learn from, focusing training on the most impactful experiences.

Why it matters?

This research is important because it makes reinforcement learning more efficient and effective. PVPO achieves top-level performance on a variety of challenging tasks and works well with both small and large AI models, meaning it has the potential to be widely applied to real-world problems like robotics, game playing, and resource management.

Abstract

Critic-free reinforcement learning methods, particularly group policies, have attracted considerable attention for their efficiency in complex tasks. However, these methods rely heavily on multiple sampling and comparisons within the policy to estimate advantage, which may cause the policy to fall into local optimum and increase computational cost. To address these issues, we propose PVPO, an efficient reinforcement learning method enhanced by an advantage reference anchor and data pre-sampling. Specifically, we use the reference model to rollout in advance and employ the calculated reward score as a reference anchor. Our approach effectively corrects the cumulative bias introduced by intra-group comparisons and significantly reduces reliance on the number of rollouts. Meanwhile, the reference model can assess sample difficulty during data pre-sampling, enabling effective selection of high-gain data to improve training efficiency. Experiments conducted on nine datasets across two domains demonstrate that PVPO achieves State-Of-The-Art (SOTA) performance. Our approach not only demonstrates robust generalization across multiple tasks, but also exhibits scalable performance across models of varying scales.

View Paper