Staying in the Sweet Spot: Responsive Reasoning Evolution via Capability-Adaptive Hint Scaffolding

Ziheng Li, Zexu Sun, Jinman Zhao, Erxue Min, Yongcheng Zeng, Hui Wu, Hengyi Cai, Shuaiqiang Wang, Dawei Yin, Xu Chen, Zhi-Hong Deng

2025-09-10

Staying in the Sweet Spot: Responsive Reasoning Evolution via Capability-Adaptive Hint Scaffolding

Summary

This paper focuses on improving how large language models (LLMs) learn to solve complex problems, specifically in areas like math reasoning, using a technique called reinforcement learning with verifiable rewards. The core idea is to make the learning process more efficient by carefully controlling the difficulty of the problems the LLM is trying to solve.

What's the problem?

When LLMs are trained using reinforcement learning, they often struggle because the problems are either too hard or too easy. If a problem is too difficult, the LLM can't figure out a solution path and doesn't learn much. If it's too easy, the LLM doesn't really push its abilities and doesn't improve significantly. Existing methods don't effectively adjust the problem difficulty to match the LLM's current skill level, leading to wasted effort and slow learning.

What's the solution?

The researchers developed a new framework called SEELE that dynamically adjusts the difficulty of each problem during training. It does this by adding hints to the problems – essentially giving the LLM a little bit of the solution. However, SEELE doesn't just give the same hint to every problem. Instead, it cleverly figures out *how much* of a hint is needed for each specific problem. It uses a process of trial and error, predicting how helpful a hint will be based on previous attempts and then adjusting the hint length accordingly. This ensures the LLM is always challenged appropriately.

Why it matters?

This work is important because it makes LLMs learn more effectively. By optimizing the difficulty of the training problems, SEELE allows LLMs to improve their reasoning skills faster and achieve better results on challenging tasks like solving math problems. This could lead to more powerful and reliable AI systems capable of tackling complex real-world problems.

Abstract

Reinforcement learning with verifiable rewards (RLVR) has achieved remarkable success in enhancing the reasoning capabilities of large language models (LLMs). However, existing RLVR methods often suffer from exploration inefficiency due to mismatches between the training data's difficulty and the model's capability. LLMs fail to discover viable reasoning paths when problems are overly difficult, while learning little new capability when problems are too simple. In this work, we formalize the impact of problem difficulty by quantifying the relationship between loss descent speed and rollout accuracy. Building on this analysis, we propose SEELE, a novel supervision-aided RLVR framework that dynamically adjusts problem difficulty to stay within the high-efficiency region. SEELE augments each training sample by appending a hint (part of a full solution) after the original problem. Unlike previous hint-based approaches, SEELE deliberately and adaptively adjusts the hint length for each problem to achieve an optimal difficulty. To determine the optimal hint length, SEELE employs a multi-round rollout sampling strategy. In each round, it fits an item response theory model to the accuracy-hint pairs collected in preceding rounds to predict the required hint length for the next round. This instance-level, real-time difficulty adjustment aligns problem difficulty with the evolving model capability, thereby improving exploration efficiency. Experimental results show that SEELE outperforms Group Relative Policy Optimization (GRPO) and Supervised Fine-tuning (SFT) by +11.8 and +10.5 points, respectively, and surpasses the best previous supervision-aided approach by +3.6 points on average across six math reasoning benchmarks.

View Paper