Implicit Actor Critic Coupling via a Supervised Learning Framework for RLVR
Jiaming Li, Longze Chen, Ze Gong, Yukun Chen, Lu Wang, Wanwei He, Run Luo, Min Yang
2025-09-03
Summary
This paper introduces a new method, called PACS, to help large language models (LLMs) get better at tasks that require reasoning, like solving math problems or writing code. It builds on a technique called Reinforcement Learning with Verifiable Rewards (RLVR), which trains LLMs by giving them feedback on whether their answers are correct.
What's the problem?
Training LLMs with RLVR can be tricky because it's often hard to get clear signals about whether an answer is right or wrong – these signals are 'sparse'. Also, the usual methods for updating the LLM's 'policy' (how it makes decisions) can be unstable, meaning the model might not learn consistently or efficiently. Essentially, the LLM struggles to learn from the feedback it receives.
What's the solution?
PACS solves this by reframing the problem as a supervised learning task. Instead of directly trying to maximize a reward, it predicts whether an answer is correct, treating the correct/incorrect outcome as a label. This allows the model to learn more stably and efficiently, and it cleverly combines the roles of the 'actor' (the LLM making decisions) and the 'critic' (the part evaluating those decisions) within a single learning process. It uses a common technique called cross-entropy loss to improve its predictions.
Why it matters?
PACS shows significant improvements in mathematical reasoning compared to existing RLVR methods. It achieved a higher score on a challenging math problem set (AIME 2025) than previous approaches. This is important because it offers a more reliable and effective way to improve LLMs after they’ve already been initially trained, making them better at complex reasoning tasks and opening up possibilities for more advanced AI applications.
Abstract
Recent advances in Reinforcement Learning with Verifiable Rewards (RLVR) have empowered large language models (LLMs) to tackle challenging reasoning tasks such as mathematics and programming. RLVR leverages verifiable outcome rewards to guide policy optimization, enabling LLMs to progressively improve output quality in a grounded and reliable manner. Despite its promise, the RLVR paradigm poses significant challenges, as existing methods often suffer from sparse reward signals and unstable policy gradient updates, particularly in RL-based approaches. To address the challenges, we propose PACS, a novel RLVR framework that achieves imPlicit Actor Critic coupling via a Supervised learning framework. By treating the outcome reward as a predictable label, we reformulate the RLVR problem into a supervised learning task over a score function parameterized by the policy model and optimized using cross-entropy loss. A detailed gradient analysis shows that this supervised formulation inherently recovers the classical policy gradient update while implicitly coupling actor and critic roles, yielding more stable and efficient training. Benchmarking on challenging mathematical reasoning tasks, PACS outperforms strong RLVR baselines, such as PPO and GRPO, achieving superior reasoning performance. For instance, PACS achieves 59.78\% at pass@256 on AIME 2025, representing improvements of 13.32 and 14.36 points over PPO and GRPO. This simple yet powerful framework offers a promising avenue for LLMs post-training with verifiable rewards. Our code and data are available as open source at https://github.com/ritzz-ai/PACS.