RAGEN-2: Reasoning Collapse in Agentic RL

Zihan Wang, Chi Gui, Xing Jin, Qineng Wang, Licheng Liu, Kangrui Wang, Shiqi Chen, Linjie Li, Zhengyuan Yang, Pingyue Zhang, Yiping Lu, Jiajun Wu, Li Fei-Fei, Lijuan Wang, Yejin Choi, Manling Li

2026-04-09

RAGEN-2: Reasoning Collapse in Agentic RL

Summary

This paper investigates a problem with training large language models (LLMs) to act as agents that can have conversations over multiple turns. It finds that current methods for checking if the LLM is 'thinking' well don't always tell the whole story, and proposes a new way to measure reasoning quality and improve training.

What's the problem?

When training LLMs to be agents, it's hard to tell if they're actually responding to the specific input they're given, or if they're just using a pre-programmed, generic response. A common way to check for good reasoning is to measure 'entropy,' which looks at how diverse the LLM's responses are for a single input. However, the researchers discovered that LLMs can appear to have high entropy – meaning diverse responses – but still be ignoring the actual input, relying on fixed templates. This is called 'template collapse,' and existing methods can't detect it. Essentially, the LLM *looks* like it's thinking, but isn't actually processing the information.

What's the solution?

The researchers introduced a new metric called 'Mutual Information' (MI) to measure how much the LLM's reasoning changes depending on different inputs. Unlike entropy, MI checks if the reasoning is actually *distinguishable* across different prompts. They found MI is a much better indicator of performance than entropy. They also explained that template collapse happens when the reward signal is weak, allowing the model to fall back on simple, generic responses. To fix this, they developed a method called 'SNR-Aware Filtering' which identifies and focuses training on prompts that give a clear signal of success, using the variation in rewards as a guide.

Why it matters?

This work is important because it identifies a hidden flaw in how we train LLM agents. If we don't know if an LLM is truly reasoning, we can't trust it to perform complex tasks reliably. By introducing Mutual Information and SNR-Aware Filtering, the researchers provide better tools for diagnosing and preventing template collapse, leading to more robust and capable AI agents that actually understand and respond to the world around them.

Abstract

RL training of multi-turn LLM agents is inherently unstable, and reasoning quality directly determines task performance. Entropy is widely used to track reasoning stability. However, entropy only measures diversity within the same input, and cannot tell whether reasoning actually responds to different inputs. In RAGEN-2, we find that even with stable entropy, models can rely on fixed templates that look diverse but are input-agnostic. We call this template collapse, a failure mode invisible to entropy and all existing metrics. To diagnose this failure, we decompose reasoning quality into within-input diversity (Entropy) and cross-input distinguishability (Mutual Information, MI), and introduce a family of mutual information proxies for online diagnosis. Across diverse tasks, mutual information correlates with final performance much more strongly than entropy, making it a more reliable proxy for reasoning quality. We further explain template collapse with a signal-to-noise ratio (SNR) mechanism. Low reward variance weakens task gradients, letting regularization terms dominate and erase cross-input reasoning differences. To address this, we propose SNR-Aware Filtering to select high-signal prompts per iteration using reward variance as a lightweight proxy. Across planning, math reasoning, web navigation, and code execution, the method consistently improves both input dependence and task performance.

View Paper