Dr. Zero: Self-Evolving Search Agents without Training Data

Zhenrui Yue, Kartikeya Upasani, Xianjun Yang, Suyu Ge, Shaoliang Nie, Yuning Mao, Zhe Liu, Dong Wang

2026-01-13

Dr. Zero: Self-Evolving Search Agents without Training Data

Summary

This paper introduces a new method, called Dr. Zero, for improving large language models (LLMs) without needing any labeled training data. It focuses on making these models better at complex problem-solving, specifically when they need to search for information and use tools to find answers.

What's the problem?

Large language models are getting really good, but they usually need tons of examples to learn from. Getting that data is hard. When you try to let these models improve *themselves* without any data – called data-free self-evolution – it’s especially tricky for models that need to ask questions and use tools over multiple steps. These 'search agents' get stuck because the questions they come up with aren’t diverse enough, and figuring out if a question is good or bad takes a lot of computing power.

What's the solution?

Dr. Zero tackles this by creating a system where two parts of the model work together: a 'proposer' and a 'solver'. The proposer generates questions, and the solver tries to answer them. As the solver gets better, it pushes the proposer to create harder, but still solvable, questions. This creates a learning loop where both parts improve. To make this process faster and less demanding on computers, they also developed a technique called hop-grouped relative policy optimization (HRPO). HRPO groups similar questions together to more efficiently figure out how difficult and solvable each question is, reducing the overall computing needed.

Why it matters?

This research is important because it shows that complex reasoning and problem-solving skills can be developed in large language models *without* relying on massive datasets created by humans. This means we can potentially build more powerful and adaptable AI systems even when data is scarce, and it opens up possibilities for AI to learn and improve on its own.

Abstract

As high-quality data becomes increasingly difficult to obtain, data-free self-evolution has emerged as a promising paradigm. This approach allows large language models (LLMs) to autonomously generate and solve complex problems, thereby improving their reasoning capabilities. However, multi-turn search agents struggle in data-free self-evolution due to the limited question diversity and the substantial compute required for multi-step reasoning and tool using. In this work, we introduce Dr. Zero, a framework enabling search agents to effectively self-evolve without any training data. In particular, we design a self-evolution feedback loop where a proposer generates diverse questions to train a solver initialized from the same base model. As the solver evolves, it incentivizes the proposer to produce increasingly difficult yet solvable tasks, thus establishing an automated curriculum to refine both agents. To enhance training efficiency, we also introduce hop-grouped relative policy optimization (HRPO). This method clusters structurally similar questions to construct group-level baselines, effectively minimizing the sampling overhead in evaluating each query's individual difficulty and solvability. Consequently, HRPO significantly reduces the compute requirements for solver training without compromising performance or stability. Extensive experiment results demonstrate that the data-free Dr. Zero matches or surpasses fully supervised search agents, proving that complex reasoning and search capabilities can emerge solely through self-evolution.

View Paper