Repurposing Synthetic Data for Fine-grained Search Agent Supervision
Yida Zhao, Kuan Li, Xixi Wu, Liwen Zhang, Dingchu Zhang, Baixuan Li, Maojia Song, Zhuo Chen, Chenxi Wang, Xinyu Wang, Kewei Tu, Pengjun Xie, Jingren Zhou, Yong Jiang
2025-10-29
Summary
This paper focuses on improving how AI agents learn to search for information and answer complex questions, specifically those that require using multiple tools or sources. It's about making these agents better at learning even when they get close to the right answer but aren't quite perfect.
What's the problem?
Current methods for training these AI agents throw away important information about *what* the agent correctly identified during its search process. They only focus on whether the final answer was right or wrong. This means the agent can't learn from attempts where it understood most of the problem but made a small mistake at the end, essentially missing out on valuable learning opportunities. It treats a nearly correct attempt the same as a complete failure.
What's the solution?
The researchers realized that the more of the key pieces of information (entities) the agent correctly identifies during its reasoning, the more likely it is to get the final answer right. They developed a new training method called Entity-aware Group Relative Policy Optimization (E-GRPO). This method gives the agent partial credit for correctly identifying these key pieces of information, even if the final answer is wrong. This allows the agent to learn from those 'near-miss' attempts and improve more effectively.
Why it matters?
This research is important because it makes AI agents more efficient and accurate. By learning from partial successes, the agents not only get better at answering questions but also learn to do so using fewer steps and searches. This means they can solve problems faster and with less computing power, making them more practical for real-world applications like research and question answering.
Abstract
LLM-based search agents are increasingly trained on entity-centric synthetic data to solve complex, knowledge-intensive tasks. However, prevailing training methods like Group Relative Policy Optimization (GRPO) discard this rich entity information, relying instead on sparse, outcome-based rewards. This critical limitation renders them unable to distinguish informative "near-miss" samples-those with substantially correct reasoning but a flawed final answer-from complete failures, thus discarding valuable learning signals. We address this by leveraging the very entities discarded during training. Our empirical analysis reveals a strong positive correlation between the number of ground-truth entities identified during an agent's reasoning process and final answer accuracy. Building on this insight, we introduce Entity-aware Group Relative Policy Optimization (E-GRPO), a novel framework that formulates a dense entity-aware reward function. E-GRPO assigns partial rewards to incorrect samples proportional to their entity match rate, enabling the model to effectively learn from these "near-misses". Experiments on diverse question-answering (QA) and deep research benchmarks show that E-GRPO consistently and significantly outperforms the GRPO baseline. Furthermore, our analysis reveals that E-GRPO not only achieves superior accuracy but also induces more efficient reasoning policies that require fewer tool calls, demonstrating a more effective and sample-efficient approach to aligning search agents.