VAPO: Efficient and Reliable Reinforcement Learning for Advanced Reasoning Tasks
YuYue, Yufeng Yuan, Qiying Yu, Xiaochen Zuo, Ruofei Zhu, Wenyuan Xu, Jiaze Chen, Chengyi Wang, TianTian Fan, Zhengyin Du, Xiangpeng Wei, Gaohong Liu, Juncai Liu, Lingjun Liu, Haibin Lin, Zhiqi Lin, Bole Ma, Chi Zhang, Mofan Zhang, Wang Zhang, Hang Zhu, Ru Zhang
2025-04-08
Summary
This paper talks about VAPO, a smart training method that helps AI solve complex math problems by learning to think through steps more efficiently and reliably.
What's the problem?
When AI tries to solve tough problems with long thinking processes, it often gets stuck due to biased learning, messy step lengths, and unclear feedback on which steps are correct.
What's the solution?
VAPO uses a special reward system that adjusts to different problem lengths, reduces learning bias by starting with basic knowledge, and gives clearer feedback to keep AI training stable and fast.
Why it matters?
This helps create AI tutors and assistants that can solve hard math and logic problems quickly without crashing, making them more useful for homework help and real-world tasks.
Abstract
We present VAPO, Value-based Augmented Proximal Policy Optimization framework for reasoning models., a novel framework tailored for reasoning models within the value-based paradigm. Benchmarked the AIME 2024 dataset, VAPO, built on the Qwen 32B pre-trained model, attains a state-of-the-art score of 60.4. In direct comparison under identical experimental settings, VAPO outperforms the previously reported results of DeepSeek-R1-Zero-Qwen-32B and DAPO by more than 10 points. The training process of VAPO stands out for its stability and efficiency. It reaches state-of-the-art performance within a mere 5,000 steps. Moreover, across multiple independent runs, no training crashes occur, underscoring its reliability. This research delves into long chain-of-thought (long-CoT) reasoning using a value-based reinforcement learning framework. We pinpoint three key challenges that plague value-based methods: value model bias, the presence of heterogeneous sequence lengths, and the sparsity of reward signals. Through systematic design, VAPO offers an integrated solution that effectively alleviates these challenges, enabling enhanced performance in long-CoT reasoning tasks.