Stabilizing Knowledge, Promoting Reasoning: Dual-Token Constraints for RLVR

Jiakang Wang, Runze Liu, Fuzheng Zhang, Xiu Li, Guorui Zhou

2025-07-22

Stabilizing Knowledge, Promoting Reasoning: Dual-Token Constraints for
RLVR

Summary

This paper talks about Archer, a new approach in Reinforcement Learning with Verifiable Rewards (RLVR) that improves how AI models handle reasoning and knowledge by treating different parts of their output in unique ways.

What's the problem?

The problem is that previous RLVR methods treated all parts of the AI's responses the same, not recognizing that some parts hold important facts (knowledge tokens) while others involve thinking and problem-solving (reasoning tokens). This made the AI less effective in learning and reasoning correctly.

What's the solution?

The authors designed Archer to apply different training rules to knowledge tokens and reasoning tokens by using dual-token constraints and training updates done at the same time. This encourages the AI to keep correct facts while exploring better ways to reason, leading to better performance in math and coding tasks.

Why it matters?

This matters because it helps make AI models smarter and more accurate by stabilizing what they know while encouraging them to think better, pushing forward the capabilities of language models in complex reasoning.

Abstract

Archer, an entropy-aware RLVR approach with dual-token constraints and synchronous updates, enhances LLM reasoning abilities by differentiating between knowledge and reasoning tokens, achieving state-of-the-art performance on mathematical reasoning and code generation benchmarks.

View Paper