Beyond Reasoning Gains: Mitigating General Capabilities Forgetting in Large Reasoning Models

Hoang Phan, Xianjun Yang, Kevin Yao, Jingyu Zhang, Shengjie Bi, Xiaocheng Tang, Madian Khabsa, Lijuan Liu, Deren Lei

2025-10-29

Beyond Reasoning Gains: Mitigating General Capabilities Forgetting in Large Reasoning Models

Summary

This paper focuses on a technique called Reinforcement Learning with Verifiable Rewards (RLVR), which is used to improve the reasoning abilities of AI models that can understand both text and images. The research addresses a key issue with this technique: models can sometimes 'forget' basic skills while learning new, more complex ones.

What's the problem?

When you try to make AI models better at complex tasks using RLVR, they often get worse at things they already knew how to do well, like correctly identifying objects in images or staying true to the original information. Simply adding rules to prevent big changes doesn't fully solve this because those rules only look at the current task, not the model's overall knowledge. Also, it's hard to figure out how much training time should be spent on different skills when the model is learning from a variety of sources.

What's the solution?

The researchers developed a new method called RECAP, which stands for REplay with dynamic objective reweighting for general knowledge preservation. RECAP constantly monitors how well the model is learning each skill and adjusts the training focus accordingly. If a skill is already mastered, RECAP spends less time on it, and if a skill is struggling or unstable, it gets more attention. This happens automatically during training without needing to change the core RLVR process or train extra parts of the AI.

Why it matters?

This research is important because it allows us to improve AI models' reasoning abilities without sacrificing their existing knowledge. By preventing 'forgetting,' we can build more reliable and versatile AI systems that are good at a wider range of tasks, and it makes it easier to balance learning new skills with maintaining existing ones.

Abstract

Reinforcement learning with verifiable rewards (RLVR) has delivered impressive gains in mathematical and multimodal reasoning and has become a standard post-training paradigm for contemporary language and vision-language models. However, the RLVR recipe introduces a significant risk of capability regression, where models forget foundational skills after prolonged training without employing regularization strategies. We empirically confirm this concern, observing that open-source reasoning models suffer performance degradation on core capabilities such as perception and faithfulness. While imposing regularization terms like KL divergence can help prevent deviation from the base model, these terms are calculated on the current task, thus they do not guarantee broader knowledge. Meanwhile, commonly used experience replay across heterogeneous domains makes it nontrivial to decide how much training focus each objective should receive. To address this, we propose RECAP-a replay strategy with dynamic objective reweighting for general knowledge preservation. Our reweighting mechanism adapts in an online manner using short-horizon signals of convergence and instability, shifting the post-training focus away from saturated objectives and toward underperforming or volatile ones. Our method is end-to-end and readily applicable to existing RLVR pipelines without training additional models or heavy tuning. Extensive experiments on benchmarks based on Qwen2.5-VL-3B and Qwen2.5-VL-7B demonstrate the effectiveness of our method, which not only preserves general capabilities but also improves reasoning by enabling more flexible trade-offs among in-task rewards.

View Paper