Breaking the Exploration Bottleneck: Rubric-Scaffolded Reinforcement Learning for General LLM Reasoning

Yang Zhou, Sunzhu Li, Shunyu Liu, Wenkai Fang, Jiale Zhao, Jingwen Yang, Jianwei Lv, Kongcheng Zhang, Yihe Zhou, Hengtong Lu, Wei Chen, Yan Xie, Mingli Song

2025-08-26

Breaking the Exploration Bottleneck: Rubric-Scaffolded Reinforcement Learning for General LLM Reasoning

Summary

This paper introduces a new method called RuscaRL to improve the reasoning abilities of Large Language Models (LLMs) using Reinforcement Learning (RL). It tackles the issue of LLMs getting stuck in patterns and not being able to explore new, potentially better ways to solve problems.

What's the problem?

Large Language Models are getting better at reasoning thanks to Reinforcement Learning, but there's a catch. RL needs good examples to learn from, and LLMs struggle to *find* those good examples on their own. They tend to repeat what they already know, creating a loop where they can't learn anything new because they can't explore effectively. Essentially, if an LLM can't think of a solution, it can't learn how to get to that solution, limiting its potential.

What's the solution?

RuscaRL solves this by giving the LLM 'scaffolding' in the form of checklists, or rubrics. Think of it like a study guide that breaks down a complex task into smaller, manageable steps. These rubrics guide the LLM to generate diverse and high-quality responses during the exploration phase. As the model learns, the rubrics are gradually removed, forcing it to rely on its own reasoning skills. The rubrics also help to accurately score the LLM's responses, providing a reliable reward signal for the RL process. This allows the model to learn effectively even on difficult reasoning tasks.

Why it matters?

This research is important because it significantly boosts the reasoning capabilities of LLMs, even surpassing models like GPT-4 on certain benchmarks, specifically in the medical domain (HealthBench-500). By breaking the exploration bottleneck, RuscaRL allows LLMs to tackle more complex problems and potentially become more reliable and helpful tools in areas like healthcare and beyond.

Abstract

Recent advances in Large Language Models (LLMs) have underscored the potential of Reinforcement Learning (RL) to facilitate the emergence of reasoning capabilities. Despite the encouraging results, a fundamental dilemma persists as RL improvement relies on learning from high-quality samples, yet the exploration for such samples remains bounded by the inherent limitations of LLMs. This, in effect, creates an undesirable cycle in which what cannot be explored cannot be learned. In this work, we propose Rubric-Scaffolded Reinforcement Learning (RuscaRL), a novel instructional scaffolding framework designed to break the exploration bottleneck for general LLM reasoning. Specifically, RuscaRL introduces checklist-style rubrics as (1) explicit scaffolding for exploration during rollout generation, where different rubrics are provided as external guidance within task instructions to steer diverse high-quality responses. This guidance is gradually decayed over time, encouraging the model to internalize the underlying reasoning patterns; (2) verifiable rewards for exploitation during model training, where we can obtain robust LLM-as-a-Judge scores using rubrics as references, enabling effective RL on general reasoning tasks. Extensive experiments demonstrate the superiority of the proposed RuscaRL across various benchmarks, effectively expanding reasoning boundaries under the best-of-N evaluation. Notably, RuscaRL significantly boosts Qwen-2.5-7B-Instruct from 23.6 to 50.3 on HealthBench-500, surpassing GPT-4.1. Furthermore, our fine-tuned variant on Qwen3-30B-A3B-Instruct achieves 61.1 on HealthBench-500, outperforming leading LLMs including OpenAI-o3.

View Paper