Can LLMs Guide Their Own Exploration? Gradient-Guided Reinforcement Learning for LLM Reasoning

Zhenwen Liang, Sidi Lu, Wenhao Yu, Kishan Panaganti, Yujun Zhou, Haitao Mi, Dong Yu

2025-12-18

Can LLMs Guide Their Own Exploration? Gradient-Guided Reinforcement Learning for LLM Reasoning

Summary

This paper introduces a new method, called G2RL, to improve how large language models learn through a process called reinforcement learning. It focuses on making the exploration phase – where the model tries out different approaches – more effective and aligned with how the model actually updates its knowledge.

What's the problem?

Currently, methods used to encourage exploration in reinforcement learning for large language models aren't very good at ensuring the model truly learns something new. Techniques like adding randomness or comparing responses to external standards only change the surface level of the responses, and don't guarantee the model is exploring genuinely different ways to improve its reasoning. They don't necessarily push the model to learn in directions that will actually make it better at solving problems.

What's the solution?

G2RL works by looking at how the model itself would change its internal settings (its 'policy') based on different responses. It calculates a 'feature' from the model's final layer that shows how each response would shift the model's learning. Then, it rewards responses that lead to changes in different, new directions, while discouraging responses that lead to redundant or unhelpful changes. This creates a self-regulating exploration process that works well with existing reinforcement learning techniques like PPO.

Why it matters?

This research is important because it provides a more effective way to train large language models to reason better. By guiding exploration based on the model's own internal learning process, G2RL consistently improves performance on challenging reasoning tasks, like math problems and general knowledge questions. It shows that understanding how a model updates itself is key to helping it learn more effectively.

Abstract

Reinforcement learning has become essential for strengthening the reasoning abilities of large language models, yet current exploration mechanisms remain fundamentally misaligned with how these models actually learn. Entropy bonuses and external semantic comparators encourage surface level variation but offer no guarantee that sampled trajectories differ in the update directions that shape optimization. We propose G2RL, a gradient guided reinforcement learning framework in which exploration is driven not by external heuristics but by the model own first order update geometry. For each response, G2RL constructs a sequence level feature from the model final layer sensitivity, obtainable at negligible cost from a standard forward pass, and measures how each trajectory would reshape the policy by comparing these features within a sampled group. Trajectories that introduce novel gradient directions receive a bounded multiplicative reward scaler, while redundant or off manifold updates are deemphasized, yielding a self referential exploration signal that is naturally aligned with PPO style stability and KL control. Across math and general reasoning benchmarks (MATH500, AMC, AIME24, AIME25, GPQA, MMLUpro) on Qwen3 base 1.7B and 4B models, G2RL consistently improves pass@1, maj@16, and pass@k over entropy based GRPO and external embedding methods. Analyzing the induced geometry, we find that G2RL expands exploration into substantially more orthogonal and often opposing gradient directions while maintaining semantic coherence, revealing that a policy own update space provides a far more faithful and effective basis for guiding exploration in large language model reinforcement learning.

View Paper