CDE: Curiosity-Driven Exploration for Efficient Reinforcement Learning in Large Language Models

Runpeng Dai, Linfeng Song, Haolin Liu, Zhenwen Liang, Dian Yu, Haitao Mi, Zhaopeng Tu, Rui Liu, Tong Zheng, Hongtu Zhu, Dong Yu

2025-09-11

CDE: Curiosity-Driven Exploration for Efficient Reinforcement Learning in Large Language Models

Summary

This paper focuses on improving how well large language models (LLMs) can reason by using a technique called Reinforcement Learning with Verifiable Rewards, or RLVR. The core idea is to train the LLM to get 'rewards' for giving correct answers, but the paper finds that current methods struggle with exploring different possibilities and often get stuck giving the same, potentially flawed, answers.

What's the problem?

When training LLMs with RLVR, a major issue is that the model doesn't explore enough different ways to solve a problem. It quickly settles on a solution, even if that solution isn't the best or most accurate. This happens because the model becomes too confident in its initial answers and stops trying new things, leading to a loss of diversity in its responses and ultimately, poor performance. Essentially, the model gets stuck in a rut and doesn't learn to think critically.

What's the solution?

The researchers introduced a new framework called Curiosity-Driven Exploration, or CDE, to encourage the LLM to be more adventurous. They do this by giving the model an extra 'bonus' for exploring answers that it's uncertain about. This bonus comes from two sources: how surprised the model is by its own answers (measured by 'perplexity') and how much disagreement there is among different parts of the model when predicting the value of an answer (measured by 'variance'). By rewarding curiosity, the model is pushed to consider a wider range of possibilities and avoid becoming overconfident.

Why it matters?

This work is important because it addresses a key limitation of using reinforcement learning to improve LLMs. By making the models more curious and encouraging exploration, we can unlock their full potential for reasoning and problem-solving. The research also helps us understand *why* LLMs sometimes fail, identifying a 'calibration collapse' where the model's confidence doesn't match its accuracy, which is a common issue with these powerful AI systems.

Abstract

Reinforcement Learning with Verifiable Rewards (RLVR) is a powerful paradigm for enhancing the reasoning ability of Large Language Models (LLMs). Yet current RLVR methods often explore poorly, leading to premature convergence and entropy collapse. To address this challenge, we introduce Curiosity-Driven Exploration (CDE), a framework that leverages the model's own intrinsic sense of curiosity to guide exploration. We formalize curiosity with signals from both the actor and the critic: for the actor, we use perplexity over its generated response, and for the critic, we use the variance of value estimates from a multi-head architecture. Both signals serve as an exploration bonus within the RLVR framework to guide the model. Our theoretical analysis shows that the actor-wise bonus inherently penalizes overconfident errors and promotes diversity among correct responses; moreover, we connect the critic-wise bonus to the well-established count-based exploration bonus in RL. Empirically, our method achieves an approximate +3 point improvement over standard RLVR using GRPO/PPO on AIME benchmarks. Further analysis identifies a calibration collapse mechanism within RLVR, shedding light on common LLM failure modes.

View Paper