Search Self-play: Pushing the Frontier of Agent Capability without Supervision

Hongliang Lu, Yuhang Wen, Pengyu Cheng, Ruijin Ding, Haotian Xu, Jiaqi Guo, Chutian Wang, Haonan Chen, Xiaoxi Jiang, Guanjun Jiang

2025-10-24

Search Self-play: Pushing the Frontier of Agent Capability without Supervision

Summary

This paper introduces a new way to train AI agents, specifically large language models, to be better at using search engines to solve complex problems. It focuses on improving a technique called Reinforcement Learning with Verifiable Rewards, which relies on giving the AI clear feedback on its performance.

What's the problem?

Currently, training these AI agents requires a lot of human effort to create good tasks and check the answers to provide that feedback. This is a bottleneck, especially when you want the AI to handle more complex, 'agentic' tasks where it needs to plan and execute multiple steps. Existing methods to automatically create tasks often struggle to make those tasks challenging enough to actually help the AI learn effectively.

What's the solution?

The researchers developed a 'self-play' system where the AI essentially plays against itself. One part of the AI, the 'proposer,' creates search queries with known answers, gradually increasing the difficulty. Another part, the 'solver,' tries to answer those queries using a search engine. To make sure the answers are verifiable, the system uses all the search results as a reference and checks if the query can be answered correctly with that information. This process of proposing and solving tasks helps both parts of the AI get better over time through competition and collaboration.

Why it matters?

This research is important because it offers a way to train AI agents to use search engines more effectively without needing as much human input. This makes it more scalable and allows the AI to tackle more complex problems, potentially leading to more capable and helpful AI assistants.

Abstract

Reinforcement learning with verifiable rewards (RLVR) has become the mainstream technique for training LLM agents. However, RLVR highly depends on well-crafted task queries and corresponding ground-truth answers to provide accurate rewards, which requires massive human efforts and hinders the RL scaling processes, especially under agentic scenarios. Although a few recent works explore task synthesis methods, the difficulty of generated agentic tasks can hardly be controlled to provide effective RL training advantages. To achieve agentic RLVR with higher scalability, we explore self-play training for deep search agents, in which the learning LLM utilizes multi-turn search engine calling and acts simultaneously as both a task proposer and a problem solver. The task proposer aims to generate deep search queries with well-defined ground-truth answers and increasing task difficulty. The problem solver tries to handle the generated search queries and output the correct answer predictions. To ensure that each generated search query has accurate ground truth, we collect all the searching results from the proposer's trajectory as external knowledge, then conduct retrieval-augmentation generation (RAG) to test whether the proposed query can be correctly answered with all necessary search documents provided. In this search self-play (SSP) game, the proposer and the solver co-evolve their agent capabilities through both competition and cooperation. With substantial experimental results, we find that SSP can significantly improve search agents' performance uniformly on various benchmarks without any supervision under both from-scratch and continuous RL training setups. The code is at https://github.com/Alibaba-Quark/SSP.

View Paper