Emergent Hierarchical Reasoning in LLMs through Reinforcement Learning

Haozhe Wang, Qixin Xu, Che Liu, Junhong Wu, Fangzhen Lin, Wenhu Chen

2025-09-10

Emergent Hierarchical Reasoning in LLMs through Reinforcement Learning

Summary

This paper investigates why using reinforcement learning to train large language models (LLMs) makes them better at complex reasoning, and it tries to figure out *how* this improvement happens.

What's the problem?

While we know reinforcement learning boosts LLM reasoning, it's been unclear what's actually going on 'under the hood'. The paper points out strange behaviors like sudden jumps in understanding ('aha moments'), performance getting better as problems get longer ('length-scaling'), and changes in how predictable the model's answers are. These seemed like separate issues, but the researchers suspected they were all connected to how the model learns to think strategically versus just getting the small details right.

What's the solution?

The researchers found that LLMs learning with reinforcement learning go through two phases. First, they focus on getting the basic steps correct – like learning the rules of a game. Then, the biggest improvements come when they start planning ahead and developing overall strategies. However, current reinforcement learning methods treat all parts of the answer the same, wasting effort on things the model already knows. To fix this, they created a new algorithm called HICRA, which specifically focuses the learning process on the parts of the answer that involve strategic planning, making the learning much more efficient. They also showed that measuring 'semantic entropy' – how much the meaning changes – is a better way to track strategic exploration than just looking at individual words.

Why it matters?

This work is important because it gives us a better understanding of how LLMs learn to reason. By identifying the shift from procedural learning to strategic planning, and by developing HICRA, we can build more effective reinforcement learning algorithms that unlock even more advanced reasoning abilities in these powerful models. It also provides a better way to measure progress in strategic thinking, which is crucial for building truly intelligent AI.

Abstract

Reinforcement Learning (RL) has proven highly effective at enhancing the complex reasoning abilities of Large Language Models (LLMs), yet underlying mechanisms driving this success remain largely opaque. Our analysis reveals that puzzling phenomena like ``aha moments", ``length-scaling'' and entropy dynamics are not disparate occurrences but hallmarks of an emergent reasoning hierarchy, akin to the separation of high-level strategic planning from low-level procedural execution in human cognition. We uncover a compelling two-phase dynamic: initially, a model is constrained by procedural correctness and must improve its low-level skills. The learning bottleneck then decisively shifts, with performance gains being driven by the exploration and mastery of high-level strategic planning. This insight exposes a core inefficiency in prevailing RL algorithms like GRPO, which apply optimization pressure agnostically and dilute the learning signal across all tokens. To address this, we propose HIerarchy-Aware Credit Assignment (HICRA), an algorithm that concentrates optimization efforts on high-impact planning tokens. HICRA significantly outperforms strong baselines, demonstrating that focusing on this strategic bottleneck is key to unlocking advanced reasoning. Furthermore, we validate semantic entropy as a superior compass for measuring strategic exploration over misleading metrics such as token-level entropy.

View Paper