Bootstrapping Exploration with Group-Level Natural Language Feedback in Reinforcement Learning
Lei Huang, Xiang Cheng, Chenxiao Zhao, Guobin Shen, Junjie Yang, Xiaocheng Feng, Yuxuan Gu, Xing Yu, Bing Qin
2026-03-12
Summary
This paper introduces a new way to train large language models using feedback, going beyond just simple 'good' or 'bad' signals.
What's the problem?
Currently, when we try to improve large language models using reinforcement learning, we usually give them a single number as a reward. However, people often give much more detailed feedback in natural language – like pointing out specific errors or suggesting improvements. Existing methods don't really take advantage of this rich information, making the learning process slow and inefficient, especially when the model isn't getting clear signals about what it's doing right or wrong.
What's the solution?
The researchers developed a framework called GOLF that uses two kinds of group-level feedback. First, it looks at external critiques, which are like suggestions for fixing mistakes. Second, it examines different attempts within a group to see what approaches didn't work and why. GOLF combines these to create helpful 'refinements' that guide the model's learning. It's like giving the model targeted hints to help it explore and improve, and it does this while continuously improving both its ability to generate text and its ability to understand and use feedback.
Why it matters?
This work is important because it makes training large language models much more efficient. By using detailed language feedback instead of just simple rewards, GOLF helps models learn faster and perform better, achieving over twice the improvement in learning speed compared to traditional methods. This could lead to more capable and helpful AI systems.
Abstract
Large language models (LLMs) typically receive diverse natural language (NL) feedback through interaction with the environment. However, current reinforcement learning (RL) algorithms rely solely on scalar rewards, leaving the rich information in NL feedback underutilized and leading to inefficient exploration. In this work, we propose GOLF, an RL framework that explicitly exploits group-level language feedback to guide targeted exploration through actionable refinements. GOLF aggregates two complementary feedback sources: (i) external critiques that pinpoint errors or propose targeted fixes, and (ii) intra-group attempts that supply alternative partial ideas and diverse failure patterns. These group-level feedbacks are aggregated to produce high-quality refinements, which are adaptively injected into training as off-policy scaffolds to provide targeted guidance in sparse-reward regions. Meanwhile, GOLF jointly optimizes generation and refinement within a unified RL loop, creating a virtuous cycle that continuously improves both capabilities. Experiments on both verifiable and non-verifiable benchmarks show that GOLF achieves superior performance and exploration efficiency, achieving 2.2times improvements in sample efficiency compared to RL methods trained solely on scalar rewards. Code is available at https://github.com/LuckyyySTA/GOLF.