AgentTrek: Agent Trajectory Synthesis via Guiding Replay with Web Tutorials

Yiheng Xu, Dunjie Lu, Zhennan Shen, Junli Wang, Zekun Wang, Yuchen Mao, Caiming Xiong, Tao Yu

2024-12-13

AgentTrek: Agent Trajectory Synthesis via Guiding Replay with Web Tutorials

Summary

This paper presents AgentTrek, a system that creates training data for AI agents that interact with graphical user interfaces (GUIs) by using information from web tutorials.

What's the problem?

Developing AI agents that can automate tasks in software applications is challenging because they need high-quality data showing how to perform complex, multi-step actions. Traditionally, this data is created by humans, which is expensive and time-consuming, making it hard to scale up the training process.

What's the solution?

AgentTrek solves this problem by automatically gathering tutorial-like texts from the internet and transforming them into structured task instructions. It uses a visual-language model (VLM) to simulate the execution of these tasks in a real digital environment. An evaluator checks whether the tasks were completed correctly, ensuring high-quality training data. This method allows for the generation of large amounts of useful data without needing extensive human involvement.

Why it matters?

This research is significant because it provides a more efficient way to train GUI agents, which can lead to better performance in automating tasks across various applications. By using web tutorials, AgentTrek not only saves time and costs but also improves the capabilities of AI agents, making them more effective and versatile in real-world scenarios.

Abstract

Graphical User Interface (GUI) agents hold great potential for automating complex tasks across diverse digital environments, from web applications to desktop software. However, the development of such agents is hindered by the lack of high-quality, multi-step trajectory data required for effective training. Existing approaches rely on expensive and labor-intensive human annotation, making them unsustainable at scale. To address this challenge, we propose AgentTrek, a scalable data synthesis pipeline that generates high-quality GUI agent trajectories by leveraging web tutorials. Our method automatically gathers tutorial-like texts from the internet, transforms them into task goals with step-by-step instructions, and employs a visual-language model agent to simulate their execution in a real digital environment. A VLM-based evaluator ensures the correctness of the generated trajectories. We demonstrate that training GUI agents with these synthesized trajectories significantly improves their grounding and planning performance over the current models. Moreover, our approach is more cost-efficient compared to traditional human annotation methods. This work underscores the potential of guided replay with web tutorials as a viable strategy for large-scale GUI agent training, paving the way for more capable and autonomous digital agents.

View Paper