WebShaper: Agentically Data Synthesizing via Information-Seeking Formalization
Zhengwei Tao, Jialong Wu, Wenbiao Yin, Junkai Zhang, Baixuan Li, Haiyang Shen, Kuan Li, Liwen Zhang, Xinyu Wang, Yong Jiang, Pengjun Xie, Fei Huang, Jingren Zhou
2025-07-22
Summary
This paper talks about WebShaper, a new system that helps AI models learn to find and understand information on the web by creating better, more organized training data.
What's the problem?
The problem is that current methods gather random web data first and then generate questions, which can create mismatches between the information and the reasoning needed to answer questions well.
What's the solution?
WebShaper fixes this by starting with a clear plan of what the AI should learn, using math from set theory and a concept called Knowledge Projections to guide the creation of questions and answers. It uses a step-by-step process where it makes simple tasks and then automatically makes them harder, ensuring the AI gets high-quality and well-structured practice.
Why it matters?
This matters because better training data helps AI agents become smarter and more reliable when searching for information, leading to advances in AI systems that can help people with complex web-based tasks.
Abstract
WebShaper, a formalization-driven framework, synthesizes information-seeking datasets using set theory and Knowledge Projections to enhance reasoning structure and achieve top performance in open-sourced benchmarks.