OpenSeeker: Democratizing Frontier Search Agents by Fully Open-Sourcing Training Data

Yuwen Du, Rui Ye, Shuo Tang, Xinyu Zhu, Yijun Lu, Yuzhu Cai, Siheng Chen

2026-03-17

OpenSeeker: Democratizing Frontier Search Agents by Fully Open-Sourcing Training Data

Summary

This paper introduces OpenSeeker, a new, completely open-source system designed to help Large Language Models (LLMs) perform complex searches and answer questions that require looking at information from multiple sources online.

What's the problem?

Building powerful search tools for LLMs is really hard and expensive, and currently, only big companies have the resources to do it well. This is because creating the training data needed to teach these systems how to search effectively is difficult to obtain and isn't usually shared publicly, which slows down progress for researchers who aren't part of those big companies.

What's the solution?

The researchers tackled this problem by creating their own training data using a clever method. They essentially built a system that automatically generates challenging search questions and then figures out the correct answers by exploring the web. They used two main techniques: first, they created complex questions that require finding information from multiple websites, and second, they refined the process of how the system 'thinks' through its search steps to make sure it's focusing on the most relevant information. This allowed them to train OpenSeeker with a relatively small amount of data – just over 11,000 examples.

Why it matters?

OpenSeeker is important because it levels the playing field for researchers. By releasing both the system itself and the data used to train it, anyone can now build upon this work and create even better search tools for LLMs. It also shows that you don't necessarily need massive resources or continual training to achieve top-notch performance, and even outperforms some commercial systems.

Abstract

Deep search capabilities have become an indispensable competency for frontier Large Language Model (LLM) agents, yet the development of high-performance search agents remains dominated by industrial giants due to a lack of transparent, high-quality training data. This persistent data scarcity has fundamentally hindered the progress of the broader research community in developing and innovating within this domain. To bridge this gap, we introduce OpenSeeker, the first fully open-source search agent (i.e., model and data) that achieves frontier-level performance through two core technical innovations: (1) Fact-grounded scalable controllable QA synthesis, which reverse-engineers the web graph via topological expansion and entity obfuscation to generate complex, multi-hop reasoning tasks with controllable coverage and complexity. (2) Denoised trajectory synthesis, which employs a retrospective summarization mechanism to denoise the trajectory, therefore promoting the teacher LLMs to generate high-quality actions. Experimental results demonstrate that OpenSeeker, trained (a single training run) on only 11.7k synthesized samples, achieves state-of-the-art performance across multiple benchmarks including BrowseComp, BrowseComp-ZH, xbench-DeepSearch, and WideSearch. Notably, trained with simple SFT, OpenSeeker significantly outperforms the second-best fully open-source agent DeepDive (e.g., 29.5% v.s. 15.3% on BrowseComp), and even surpasses industrial competitors such as Tongyi DeepResearch (trained via extensive continual pre-training, SFT, and RL) on BrowseComp-ZH (48.4% v.s. 46.7%). We fully open-source the complete training dataset and the model weights to democratize frontier search agent research and foster a more transparent, collaborative ecosystem.

View Paper