Synthesizing Agentic Data for Web Agents with Progressive Difficulty Enhancement Mechanisms

Shrey Pandit, Xuan-Phi Nguyen, Yifei Ming, Austin Xu, Jiayu Wang, Caiming Xiong, Shafiq Joty

2025-10-17

Synthesizing Agentic Data for Web Agents with Progressive Difficulty Enhancement Mechanisms

Summary

This paper focuses on building better 'deep research' agents – programs that can answer complicated questions by using the internet and various online tools. The core idea is to improve how these agents think through problems over a long period of time, rather than just giving quick answers.

What's the problem?

Current language models, the brains behind these agents, aren't very good at complex, multi-step reasoning needed for in-depth research. Existing methods for creating training data for these agents often don't allow for precise control over how hard or high-quality the questions are. They also often mix up whether improvements are due to the data itself or just better training techniques, making it hard to know what's actually working.

What's the solution?

The researchers developed a new way to create training data. They start with relatively simple questions and gradually make them harder until a basic web agent can't answer them correctly. The agent itself is used throughout the process to try answering, verify facts, look for alternative solutions, and filter out bad questions. They then used this data to train new agents, focusing on making sure the training process was consistent to isolate the impact of the data.

Why it matters?

This work is important because the new dataset, even though it's smaller than others, leads to the creation of more effective web agents. These agents are better at using different online tools in a varied way, avoiding getting stuck in repetitive patterns, and ultimately performing better on complex research tasks. This means we're getting closer to AI that can truly assist with in-depth information gathering and problem-solving.

Abstract

Web-based 'deep research' agents aim to solve complex question - answering tasks through long-horizon interactions with online tools. These tasks remain challenging, as the underlying language models are often not optimized for long-horizon reasoning and exploration. Prior work has proposed workflows for constructing instruction-tuning datasets, often leveraging knowledge graphs. However, such methods typically lack fine-grained control over difficulty and quality, yielding synthetic data that falls short of capturing the complexity required for long-horizon reasoning. Furthermore, many studies conflate data and training effects by comparing models trained under different optimization recipes, making it difficult to isolate and evaluate the effectiveness of the data itself. We introduce a two-pronged data synthesis pipeline that generates question - answer pairs by progressively increasing task complexity until a frontier baseline web agent fails. The baseline agent plays multiple roles in this process: attempting the questions, validating factuality, checking for alternative answers, and enforcing filtering. To evaluate the effectiveness of our synthesis methods, we adopt a controlled training setup based on distillation from strong web agents. Experiments across multiple web-based benchmarks show that our dataset - despite being smaller - enables the training of more effective web agents than existing datasets. In particular, our data exhibits twice the diversity in tool-use actions, allowing models trained on it to achieve stronger performance while avoiding repetitive tool-calling behaviors.

View Paper