Coupled Variational Reinforcement Learning for Language Model General Reasoning
Xueru Wen, Jie Lou, Yanjiang Liu, Hongyu Lin, Ben He, Xianpei Han, Le Sun, Yaojie Lu, Debing Zhang
2025-12-19
Summary
This paper focuses on improving how language models (like those powering chatbots) learn to reason through trial and error, a process called reinforcement learning.
What's the problem?
Currently, teaching these models to reason using reinforcement learning requires a way to clearly tell them when they've given a good answer – a 'reward'. Newer methods try to avoid needing these external rewards by using the model's own confidence in its answers as a guide. However, these methods often generate potential reasoning steps without considering the final answer, which can lead to wasted effort and reasoning that doesn't actually make sense in relation to the solution.
What's the solution?
The researchers introduce a new technique called Coupled Variational Reinforcement Learning (CoVRL). This method cleverly combines two approaches: one that explores many possible reasoning paths and one that focuses on paths likely to lead to correct answers. By linking these two processes together, CoVRL makes the model explore more efficiently and ensures the reasoning steps are closely tied to the final answer. It's like having a brainstorming session that's guided by a clear understanding of the goal.
Why it matters?
This work is important because it provides a more effective way to improve the reasoning abilities of language models without relying on external feedback. The experiments show CoVRL significantly boosts performance on challenging reasoning tasks, suggesting it's a promising step towards building more intelligent and reliable AI systems.
Abstract
While reinforcement learning have achieved impressive progress in language model reasoning, they are constrained by the requirement for verifiable rewards. Recent verifier-free RL methods address this limitation by utilizing the intrinsic probabilities of LLMs generating reference answers as reward signals. However, these approaches typically sample reasoning traces conditioned only on the question. This design decouples reasoning-trace sampling from answer information, leading to inefficient exploration and incoherence between traces and final answers. In this paper, we propose \b{Coupled Variational Reinforcement Learning} (CoVRL), which bridges variational inference and reinforcement learning by coupling prior and posterior distributions through a hybrid sampling strategy. By constructing and optimizing a composite distribution that integrates these two distributions, CoVRL enables efficient exploration while preserving strong thought-answer coherence. Extensive experiments on mathematical and general reasoning benchmarks show that CoVRL improves performance by 12.4\% over the base model and achieves an additional 2.3\% improvement over strong state-of-the-art verifier-free RL baselines, providing a principled framework for enhancing the general reasoning capabilities of language models.