UltraCUA: A Foundation Model for Computer Use Agents with Hybrid Action
Yuhao Yang, Zhen Yang, Zi-Yi Dou, Anh Nguyen, Keen You, Omar Attia, Andrew Szot, Michael Feng, Ram Ramrakhya, Alexander Toshev, Chao Huang, Yinfei Yang, Zhe Gan
2025-10-21
Summary
This paper introduces UltraCUA, a new type of AI agent designed to use computers more effectively by combining simple actions like clicking with more complex, behind-the-scenes commands usually only available to programmers.
What's the problem?
Current AI agents that try to use computers are limited to basic actions like clicking and typing. This is slow and prone to errors because if one click is off, the whole process fails. These agents can't take advantage of the powerful tools and commands that people use when they program, making them inefficient and unreliable for complex tasks.
What's the solution?
The researchers created UltraCUA, which can both click and type *and* use programmatic tools – essentially, it can give the computer instructions like a programmer would. They built this by automatically finding and adapting existing software tools, creating a huge set of practice tasks, and then training the AI using a combination of showing it examples and letting it learn through trial and error. This allows it to strategically choose between simple actions and more powerful commands.
Why it matters?
This work is important because it makes AI agents much better at using computers. UltraCUA is faster and more reliable than previous agents, and it can even handle tasks it wasn't specifically trained on. This could lead to AI assistants that can truly help people with complex computer tasks, automating things that are currently difficult or impossible for AI to do.
Abstract
Multimodal agents for computer use rely exclusively on primitive actions (click, type, scroll) that require accurate visual grounding and lengthy execution chains, leading to cascading failures and performance bottlenecks. While other agents leverage rich programmatic interfaces (APIs, MCP servers, tools), computer-use agents (CUAs) remain isolated from these capabilities. We present UltraCUA, a foundation model that bridges this gap through hybrid action -- seamlessly integrating GUI primitives with high-level programmatic tool calls. To achieve this, our approach comprises four key components: (1) an automated pipeline that scales programmatic tools from software documentation, open-source repositories, and code generation; (2) a synthetic data engine producing over 17,000 verifiable tasks spanning real-world computer-use scenarios; (3) a large-scale high-quality hybrid action trajectory collection with both low-level GUI actions and high-level programmatic tool calls; and (4) a two-stage training pipeline combining supervised fine-tuning with online reinforcement learning, enabling strategic alternation between low-level and high-level actions. Experiments with our 7B and 32B models demonstrate substantial improvements over state-of-the-art agents. On OSWorld, UltraCUA models achieve an average 22% relative improvement over base models, while being 11% faster in terms of steps. Out-of-domain evaluation on WindowsAgentArena shows our model reaches 21.7% success rate, outperforming baselines trained on Windows data. The hybrid action mechanism proves critical, reducing error propagation while maintaining execution efficiency.