Decouple to Generalize: Context-First Self-Evolving Learning for Data-Scarce Vision-Language Reasoning

Tingyu Li, Zheng Sun, Jingxuan Wei, Siyuan Li, Conghui He, Lijun Wu, Cheng Tan

2025-12-09

Decouple to Generalize: Context-First Self-Evolving Learning for Data-Scarce Vision-Language Reasoning

Summary

This paper addresses the challenge of teaching large vision-language models (VLMs) to reason and improve over time, specifically using a technique called reinforcement learning.

What's the problem?

Training VLMs with reinforcement learning requires a lot of good data, which is hard to get, especially in specialized fields like chemistry or math. Current methods try to create fake data or have the model reward itself, but these often lead to the model finding loopholes to get high scores without actually learning to solve problems correctly – this is called 'reward hacking' and makes the model unstable.

What's the solution?

The researchers developed a system called DoGe, which stands for 'Decouple to Generalize'. DoGe breaks down the learning process into two parts: a 'Thinker' that focuses on understanding the context of a problem, and a 'Solver' that actually tries to solve it. This helps the model learn to understand *what* a problem is asking before trying to solve it. They also created a system that continuously adds new, relevant problems to the training data, making it more diverse and challenging.

Why it matters?

This work is important because it provides a way to build VLMs that can continuously learn and improve on their own, even in areas where getting real-world data is difficult. It offers a more stable and effective approach to reinforcement learning for these models, paving the way for more powerful and adaptable AI systems.

Abstract

Recent vision-language models (VLMs) achieve remarkable reasoning through reinforcement learning (RL), which provides a feasible solution for realizing continuous self-evolving large vision-language models (LVLMs) in the era of experience. However, RL for VLMs requires abundant high-quality multimodal data, especially challenging in specialized domains like chemistry, earth sciences, and multimodal mathematics. Existing strategies such as synthetic data and self-rewarding mechanisms suffer from limited distributions and alignment difficulties, ultimately causing reward hacking: models exploit high-reward patterns, collapsing policy entropy and destabilizing training. We propose DoGe (Decouple to Generalize), a dual-decoupling framework that guides models to first learn from context rather than problem solving by refocusing on the problem context scenarios overlooked by synthetic data methods. By decoupling learning process into dual components (Thinker and Solver), we reasonably quantify the reward signals of this process and propose a two-stage RL post-training approach from freely exploring context to practically solving tasks. Second, to increase the diversity of training data, DoGe constructs an evolving curriculum learning pipeline: an expanded native domain knowledge corpus and an iteratively evolving seed problems pool. Experiments show that our method consistently outperforms the baseline across various benchmarks, providing a scalable pathway for realizing self-evolving LVLMs.

View Paper