From Imitation to Discrimination: Toward A Generalized Curriculum Advantage Mechanism Enhancing Cross-Domain Reasoning Tasks
Changpeng Yang, Jinyang Wu, Yuchen Liu, Shuai Zhang, Yang Li, Qiliang Liang, Hongzhen Wang, Shuai Nie, Jiaming Xu, Runyu Shi, Ying Huang, Guoquan Zhang
2025-12-08
Summary
This paper explores how to improve large language models after they've already been trained, specifically focusing on making them better at reasoning tasks. It introduces a new method called CAPO to help these models learn more effectively from feedback.
What's the problem?
When you try to improve a language model with feedback, that feedback can be both positive (the model did well) and negative (the model did poorly). Existing methods often mix these signals together right away, which can confuse the model early in the learning process and limit how much it actually improves. It's like trying to learn something new when someone is constantly both praising and criticizing you at the same time – it's hard to know what to focus on.
What's the solution?
The researchers developed CAPO, which stands for Curriculum Advantage Policy Optimization. This method doesn't throw all the feedback at the model at once. Instead, it first lets the model learn from *only* the positive feedback, building a strong foundation. Then, it gradually introduces the negative feedback to help the model learn to distinguish between good and bad answers. This staged approach, or 'curriculum,' helps the model learn more effectively and generalize better to different situations.
Why it matters?
This research is important because it provides a more reliable and effective way to improve large language models. CAPO works well with different learning techniques and has shown improvements in both mathematical reasoning and understanding visual interfaces like GUIs. This means it's a versatile tool that can help make AI systems more capable and useful in a wider range of applications.
Abstract
Reinforcement learning has emerged as a paradigm for post-training large language models, boosting their reasoning capabilities. Such approaches compute an advantage value for each sample, reflecting better or worse performance than expected, thereby yielding both positive and negative signals for training. However, the indiscriminate mixing of the two signals in existing methods, especially from the early stages, may lead to ambiguous guidance and limited gains. To address this issue, we propose **CAPO** (**C**urriculum **A**dvantage **P**olicy **O**ptimization), an adaptive curriculum mechanism based on advantage signals. The proposed mechanism bootstraps imitation learning with positive-only advantage samples to establish robust foundations, and subsequently introduces negative signals to cultivate discriminative capabilities, thereby improving generalization across complex scenarios. Compatible with diverse optimization methods including GRPO, PPO, RLOO, and Reinforce++, our method consistently achieves stable and significant improvements in mathematical reasoning tasks, and further generalizes effectively to multimodal Graphical User Interface (GUI) reasoning scenarios, establishing itself as a versatile and robust optimization framework.