VideoAgentTrek: Computer Use Pretraining from Unlabeled Videos
Dunjie Lu, Yiheng Xu, Junli Wang, Haoyuan Wu, Xinyuan Wang, Zekun Wang, Junlin Yang, Hongjin Su, Jixuan Chen, Junda Chen, Yuchen Mao, Jingren Zhou, Junyang Lin, Binyuan Hui, Tao Yu
2025-10-23
Summary
This paper introduces a new way to create training data for computer agents that learn to use software, by automatically extracting information from existing online tutorial videos.
What's the problem?
Teaching a computer to use software, like clicking buttons and typing, requires a lot of example data showing it what to do. Getting this data manually is incredibly time-consuming and expensive because someone has to watch videos and label every single action the computer should take.
What's the solution?
The researchers developed a system called VideoAgentTrek that automatically finds and labels actions within screen-recorded videos. It works in two main steps: first, it identifies *where* and *when* actions happen in the video, and second, it figures out *what* those actions are, like the exact coordinates of a click or the text that was typed. They used this to create a huge dataset of over a million actions from YouTube tutorials.
Why it matters?
This research is important because it offers a way to build better computer agents without needing to pay people to manually label tons of data. By using freely available videos, they were able to significantly improve the agents’ performance on tasks, making them more capable of automating computer-based jobs.
Abstract
Training computer-use agents requires massive amounts of GUI interaction data, but manually annotating action trajectories at scale is prohibitively expensive. We present VideoAgentTrek, a scalable pipeline that automatically mines training data from publicly available screen-recorded videos at web scale, eliminating the need for manual annotation. Our approach addresses a key challenge: raw videos contain implicit demonstrations but lack explicit action labels. To solve this, we develop Video2Action, an inverse dynamics module (IDM) with two components: (1) a video grounding model that detects and localizes GUI actions with precise temporal boundaries and context, and (2) an action-content recognizer that extracts structured parameters like click coordinates and typed text with high fidelity. Applied to 39,000 YouTube tutorial videos, our pipeline generates 1.52 million interaction steps automatically. We leverage this data through continued pretraining followed by supervised fine-tuning. On OSWorld-Verified, our approach improves task success rates from 9.3% (SFT-only baseline) to 15.8%, a 70% relative improvement. On AgentNetBench, step accuracy increases from 64.1% to 69.3%. Our results demonstrate that passive internet videos can be transformed into high-quality supervision for computer-use agents, providing a scalable alternative to expensive manual annotation.