Bottom-up Policy Optimization: Your Language Model Policy Secretly Contains Internal Policies

Yuqiao Tan, Minzheng Wang, Shizhu He, Huanxuan Liao, Chengfeng Zhao, Qiunan Lu, Tian Liang, Jun Zhao, Kang Liu

2025-12-24

Bottom-up Policy Optimization: Your Language Model Policy Secretly Contains Internal Policies

Summary

This paper investigates how large language models (LLMs) make decisions internally, moving beyond treating them as single 'black box' systems. It breaks down the LLM's decision-making process into contributions from individual layers and parts within those layers, and then uses this understanding to improve how these models are trained.

What's the problem?

Current methods for training LLMs using reinforcement learning don't consider the complex internal workings of the model. LLMs aren't just one thing making a decision; different layers and components within them seem to handle different parts of the reasoning process. Without understanding how these internal 'policies' develop, it's hard to effectively target improvements or understand *why* a model makes a certain decision.

What's the solution?

The researchers discovered that they could separate the LLM's decision-making into 'Internal Layer Policies' – what each layer contributes – and 'Internal Modular Policies' – what the self-attention and feed-forward parts within each layer do. They found that earlier layers explore many possibilities (high entropy), while later layers refine the answer (low entropy). Based on this, they created a new training method called 'Bottom-up Policy Optimization' (BuPO) which focuses on improving the early layers of the model, essentially building a stronger foundation for reasoning. This method directly optimizes the internal layer policy during early training.

Why it matters?

This work is important because it provides a way to peek inside LLMs and understand *how* they think. By optimizing the foundational reasoning abilities of the model, BuPO leads to better performance on complex tasks. It also shows that different LLM architectures (like Llama and Qwen) develop their reasoning processes in different ways, offering insights into how to design even better models in the future.

Abstract

Existing reinforcement learning (RL) approaches treat large language models (LLMs) as a single unified policy, overlooking their internal mechanisms. Understanding how policy evolves across layers and modules is therefore crucial for enabling more targeted optimization and raveling out complex reasoning mechanisms. In this paper, we decompose the language model policy by leveraging the intrinsic split of the Transformer residual stream and the equivalence between the composition of hidden states with the unembedding matrix and the resulting samplable policy. This decomposition reveals Internal Layer Policies, corresponding to contributions from individual layers, and Internal Modular Policies, which align with the self-attention and feed-forward network (FFN) components within each layer. By analyzing the entropy of internal policy, we find that: (a) Early layers keep high entropy for exploration, top layers converge to near-zero entropy for refinement, with convergence patterns varying across model series. (b) LLama's prediction space rapidly converges in the final layer, whereas Qwen-series models, especially Qwen3, exhibit a more human-like, progressively structured reasoning pattern. Motivated by these findings, we propose Bottom-up Policy Optimization (BuPO), a novel RL paradigm that directly optimizes the internal layer policy during early training. By aligning training objective at lower layer, BuPO reconstructs foundational reasoning capabilities and achieves superior performance. Extensive experiments on complex reasoning benchmarks demonstrates the effectiveness of our method. Our code is available at https://github.com/Trae1ounG/BuPO.

View Paper