Hierarchical Budget Policy Optimization for Adaptive Reasoning

Shangke Lyu, Linjuan Wu, Yuchen Yan, Xingyu Wu, Hao Li, Yongliang Shen, Peisheng Jiang, Weiming Lu, Jun Xiao, Yueting Zhuang

2025-07-25

Hierarchical Budget Policy Optimization for Adaptive Reasoning

Summary

This paper talks about Hierarchical Budget Policy Optimization (HBPO), a method that helps AI models decide how much effort to spend reasoning based on how hard the problem is by using a structured system of budgets.

What's the problem?

AI models usually use the same amount of thinking for every problem, which wastes time on simple problems and sometimes doesn't spend enough time on harder ones, making them inefficient and less accurate.

What's the solution?

The researchers created HBPO to divide the training process into different budget levels, where each level limits how long the model can reason. The model learns to pick the right amount of reasoning time for each problem using special rewards that balance efficiency and accuracy.

Why it matters?

This matters because HBPO helps AI models become faster and smarter by adapting their thinking effort, saving computing resources while improving their ability to solve problems correctly.

Abstract

Hierarchical Budget Policy Optimization (HBPO) is a reinforcement learning framework that optimizes reasoning depth for large models, improving efficiency and accuracy by adapting to problem complexity.

View Paper