Grounding Computer Use Agents on Human Demonstrations
Aarash Feizi, Shravan Nayak, Xiangru Jian, Kevin Qinghong Lin, Kaixin Li, Rabiul Awal, Xing Han Lù, Johan Obando-Ceron, Juan A. Rodriguez, Nicolas Chapados, David Vazquez, Adriana Romero-Soriano, Reihaneh Rabbany, Perouz Taslakian, Christopher Pal, Spandana Gella, Sai Rajeswar
2025-11-12
Summary
This paper introduces a new dataset called GroundCUA designed to help computers better understand and follow our instructions when we're using programs on a desktop computer, like clicking buttons or filling out forms.
What's the problem?
Currently, it's hard to build computer programs that can reliably act as assistants on our computers because there isn't enough good data to train them. Existing datasets focus on websites or phones, but there's a lack of high-quality examples showing how to interact with typical desktop applications like Word, Excel, or Photoshop. This makes it difficult for these programs to accurately connect what we *say* we want to do with the actual buttons and options on the screen.
What's the solution?
The researchers created GroundCUA, a large collection of screenshots from 87 different desktop programs, covering 12 different types of applications. They had people carefully label every clickable element in these screenshots – over 3.5 million labels in total! They then used these labels to create a wide variety of instructions that people might give to a computer assistant. Finally, they used this data to train a new family of models called GroundNext, which are really good at figuring out which part of the screen a user is referring to when they give an instruction. These models work well even with less training data than previous approaches, and get even better with a little extra refinement.
Why it matters?
This work is important because it provides a crucial resource for building more helpful and reliable computer assistants. By having a high-quality dataset specifically for desktop applications, researchers can create programs that can actually understand and carry out our instructions, making computers easier and more intuitive to use. It shows that having good, carefully created data is more important than just having a lot of data.
Abstract
Building reliable computer-use agents requires grounding: accurately connecting natural language instructions to the correct on-screen elements. While large datasets exist for web and mobile interactions, high-quality resources for desktop environments are limited. To address this gap, we introduce GroundCUA, a large-scale desktop grounding dataset built from expert human demonstrations. It covers 87 applications across 12 categories and includes 56K screenshots, with every on-screen element carefully annotated for a total of over 3.56M human-verified annotations. From these demonstrations, we generate diverse instructions that capture a wide range of real-world tasks, providing high-quality data for model training. Using GroundCUA, we develop the GroundNext family of models that map instructions to their target UI elements. At both 3B and 7B scales, GroundNext achieves state-of-the-art results across five benchmarks using supervised fine-tuning, while requiring less than one-tenth the training data of prior work. Reinforcement learning post-training further improves performance, and when evaluated in an agentic setting on the OSWorld benchmark using o3 as planner, GroundNext attains comparable or superior results to models trained with substantially more data,. These results demonstrate the critical role of high-quality, expert-driven datasets in advancing general-purpose computer-use agents.