On GRPO Collapse in Search-R1: The Lazy Likelihood-Displacement Death Spiral
Wenlong Deng, Yushu Li, Boying Gong, Yi Ren, Christos Thrampoulidis, Xiaoxiao Li
2025-12-05
Summary
This paper investigates a problem with a new way to train large language models (LLMs) to use tools like search engines. The method, called Group Relative Policy Optimization (GRPO), is fast and efficient, but often fails during training, leading to poor results.
What's the problem?
When LLMs are trained to use tools, GRPO sometimes experiences a 'training collapse' where the model stops learning and its performance gets worse. The researchers discovered that this happens because of something they call 'Lazy Likelihood Displacement' (LLD). Basically, the model starts to become less confident in *all* its answers, both right and wrong, early in training. This low confidence then causes the training process to amplify errors, creating a downward spiral where the model continually gets worse and worse.
What's the solution?
To fix this, the researchers created a new technique called LLDS, which stands for Likelihood-preserving Regularization. It's like a safety net that only activates when the model's confidence starts to drop. LLDS then focuses on adjusting only the specific parts of the answer that are causing the confidence to decrease, without interfering with the overall learning process. This prevents the negative spiral caused by LLD.
Why it matters?
This research is important because it identifies a key obstacle to effectively training LLMs to use tools. By understanding and addressing LLD, the researchers have made a significant step towards creating more reliable and powerful AI systems that can leverage external tools to solve complex problems. Their method improves performance substantially on several question-answering tasks, showing it's a practical solution for building better tool-integrated LLMs.
Abstract
Tool-integrated (TI) reinforcement learning (RL) enables large language models (LLMs) to perform multi-step reasoning by interacting with external tools such as search engines and retrievers. Group Relative Policy Optimization (GRPO), exemplified by the recent Search-R1, offers fast convergence and a value-free formulation that makes it appealing for this setting, yet consistently suffers from training collapse. We identify Lazy Likelihood Displacement (LLD), a systematic reduction or stagnation in the likelihood of both correct and incorrect responses, as the core mechanism driving this failure. LLD emerges early and triggers a self-reinforcing LLD Death Spiral, where declining likelihood leads to low-confidence responses, inflating gradients, and ultimately causing collapse. We empirically characterize this process across models on a Search-R1-style, search-integrated question answering task, revealing a consistent three-phase trajectory: early stagnation, steady decay, and accelerated collapse. To address this, we propose a lightweight likelihood-preserving regularization LLDS for GRPO that activates only when a trajectory's likelihood decreases, and regularizes only the tokens responsible. This fine-grained structure mitigates LLD with minimal interference to optimization. Across seven open-domain and multi-hop QA benchmarks, our method stabilizes training, prevents gradient explosion, and yields substantial performance improvements, including +37.8% gains on Qwen2.5-3B and +32.0% gains on Qwen2.5-7B. Our results establish LLD as a fundamental bottleneck in GRPO-based TIRL and provide a practical path toward stable, scalable training of tool-integrated LLM.