CUA-Suite: Massive Human-annotated Video Demonstrations for Computer-Use Agents

Xiangru Jian, Shravan Nayak, Kevin Qinghong Lin, Aarash Feizi, Kaixin Li, Patrice Bechard, Spandana Gella, Sai Rajeswar

2026-03-26

CUA-Suite: Massive Human-annotated Video Demonstrations for Computer-Use Agents

Summary

This paper introduces CUA-Suite, a new and much larger dataset designed to help build computer programs that can automatically use computers like a person does, automating tasks on your desktop.

What's the problem?

Currently, creating these 'computer-use agents' is difficult because there isn't enough good training data. Existing datasets mostly have just screenshots, which don't show *how* a person actually uses the computer – the movements of the mouse, the timing of clicks, and the overall flow of actions. The largest existing dataset is too small, containing only about 20 hours of video, which isn't enough for these agents to learn effectively.

What's the solution?

The researchers created CUA-Suite, which includes a large collection of videos showing people performing tasks on computers. This dataset, called VideoCUA, has around 55 hours of footage, showing over 10,000 tasks across 87 different programs. Importantly, it captures the full screen recording, mouse movements, and detailed notes about the reasoning behind each action. They also included two other resources: UI-Vision, a way to test how well these agents understand what they're seeing on the screen, and GroundCUA, a dataset for identifying different parts of the user interface.

Why it matters?

This work is important because it provides the necessary data to significantly improve computer-use agents. Initial tests show that current AI models struggle with real-world desktop applications, failing at tasks about 60% of the time. CUA-Suite will allow researchers to build more capable agents that can automate complex workflows, and it opens up possibilities for new research areas like better screen understanding and more realistic AI behavior.

Abstract

Computer-use agents (CUAs) hold great promise for automating complex desktop workflows, yet progress toward general-purpose agents is bottlenecked by the scarcity of continuous, high-quality human demonstration videos. Recent work emphasizes that continuous video, not sparse screenshots, is the critical missing ingredient for scaling these agents. However, the largest existing open dataset, ScaleCUA, contains only 2 million screenshots, equating to less than 20 hours of video. To address this bottleneck, we introduce CUA-Suite, a large-scale ecosystem of expert video demonstrations and dense annotations for professional desktop computer-use agents. At its core is VideoCUA, which provides approximately 10,000 human-demonstrated tasks across 87 diverse applications with continuous 30 fps screen recordings, kinematic cursor traces, and multi-layerfed reasoning annotations, totaling approximately 55 hours and 6 million frames of expert video. Unlike sparse datasets that capture only final click coordinates, these continuous video streams preserve the full temporal dynamics of human interaction, forming a superset of information that can be losslessly transformed into the formats required by existing agent frameworks. CUA-Suite further provides two complementary resources: UI-Vision, a rigorous benchmark for evaluating grounding and planning capabilities in CUAs, and GroundCUA, a large-scale grounding dataset with 56K annotated screenshots and over 3.6 million UI element annotations. Preliminary evaluation reveals that current foundation action models struggle substantially with professional desktop applications (~60% task failure rate). Beyond evaluation, CUA-Suite's rich multimodal corpus supports emerging research directions including generalist screen parsing, continuous spatial control, video-based reward modeling, and visual world models. All data and models are publicly released.

View Paper