Near-Future Policy Optimization

Chuanyu Qin, Chenxu Yang, Qingyi Si, Naibin Gu, Dingyu Yao, Zheng Lin, Peng Fu, Nan Duan, Jiaqi Wang

2026-04-23

Summary

This paper focuses on improving a technique called Reinforcement Learning with Verifiable Rewards, which helps make AI systems more reliable after they've been initially trained. It's about finding the best way to give the AI extra learning examples to boost its performance.

What's the problem?

When trying to improve an AI using extra examples, you have a trade-off. You can get examples from a really good AI (a 'teacher'), but those examples might be too different from what the current AI is doing, making them hard to learn from. Or, you can use examples from the AI's past, but those aren't very good because the AI has already learned from them. The key is finding examples that are both high-quality *and* similar enough to be useful, maximizing a balance between how much the AI can learn and how easily it can absorb the new information.

What's the solution?

The researchers came up with a method called Near-Future Policy Optimization (NPO). The idea is to use examples from a *later* version of the *same* AI during its training process. This later version is naturally better than the current one, providing high-quality examples, but it's still close enough in its learning process to be easily understood. They also created an automated version, AutoNPO, that decides when to use these examples and which later version to learn from, based on how well the AI is currently doing.

Why it matters?

This research is important because it provides a practical way to significantly improve the performance of AI systems, especially large language models. By automatically finding and using helpful learning examples, it makes these systems more capable and helps them reach their full potential, while also speeding up the training process.

Abstract

Reinforcement learning with verifiable rewards (RLVR) has become a core post-training recipe. Introducing suitable off-policy trajectories into on-policy exploration accelerates RLVR convergence and raises the performance ceiling, yet finding a source of such trajectories remains the key challenge. Existing mixed-policy methods either import trajectories from external teachers (high-quality but distributionally far) or replay past training trajectories (close but capped in quality), and neither simultaneously satisfies the strong enough (higher Q , more new knowledge to learn) and close enough (lower V , more readily absorbed) conditions required to maximize the effective learning signal S = Q/V. We propose Near-Future Policy Optimization (NPO), a simple mixed-policy scheme that learns from a policy's own near-future self: a later checkpoint from the same training run is a natural source of auxiliary trajectories that is both stronger than the current policy and closer than any external source, directly balancing trajectory quality against variance cost. We validate NPO through two manual interventions, early-stage bootstrapping and late-stage plateau breakthrough, and further propose AutoNPO,an adaptive variant that automatically triggers interventions from online training signals and selects the guide checkpoint that maximizes S. On Qwen3-VL-8B-Instruct with GRPO, NPO improves average performance from 57.88 to 62.84, and AutoNPO pushes it to 63.15, raising the final performance ceiling while accelerating convergence.

View Paper