UI-TARS: Pioneering Automated GUI Interaction with Native Agents
Yujia Qin, Yining Ye, Junjie Fang, Haoming Wang, Shihao Liang, Shizuo Tian, Junda Zhang, Jiahao Li, Yunxin Li, Shijue Huang, Wanjun Zhong, Kuanye Li, Jiale Yang, Yu Miao, Woyu Lin, Longxiang Liu, Xu Jiang, Qianli Ma, Jingyu Li, Xiaojun Xiao, Kai Cai, Chuang Li
2025-01-22
Summary
This paper talks about UI-TARS, a new AI system that can interact with computer screens and apps just like a human would. It's designed to understand what it sees on the screen and perform tasks by clicking, typing, and navigating through different programs without needing step-by-step instructions from humans.
What's the problem?
Current AI systems that try to interact with computer interfaces often rely on complicated setups and need experts to give them specific instructions. They also struggle to adapt when faced with new situations or changes in the interface. This makes them less flexible and harder to use in real-world scenarios where computer screens and apps can vary a lot.
What's the solution?
The researchers created UI-TARS, which uses several clever techniques to solve these problems. First, it learns to understand what it sees on screen by studying lots of screenshots. Then, it figures out how to interact with different apps using a standard set of actions that work across various devices. UI-TARS also uses advanced reasoning to break down complex tasks into smaller steps and learn from its mistakes. Finally, it keeps improving itself by practicing on virtual computers and learning from its experiences.
Why it matters?
This matters because it could make computers and apps much easier to use for everyone. Imagine having an AI assistant that could handle complicated tasks on your computer or phone, like booking flights or filling out forms, without you having to explain every little step. It could save people a lot of time and make technology more accessible to those who struggle with complex interfaces. For businesses, it could automate many tasks that currently require human workers, potentially increasing efficiency and reducing costs. UI-TARS represents a big step forward in creating AI that can truly understand and interact with the digital world in a human-like way.
Abstract
This paper introduces UI-TARS, a native GUI agent model that solely perceives the screenshots as input and performs human-like interactions (e.g., keyboard and mouse operations). Unlike prevailing agent frameworks that depend on heavily wrapped commercial models (e.g., GPT-4o) with expert-crafted prompts and workflows, UI-TARS is an end-to-end model that outperforms these sophisticated frameworks. Experiments demonstrate its superior performance: UI-TARS achieves SOTA performance in 10+ GUI agent benchmarks evaluating perception, grounding, and GUI task execution. Notably, in the OSWorld benchmark, UI-TARS achieves scores of 24.6 with 50 steps and 22.7 with 15 steps, outperforming Claude (22.0 and 14.9 respectively). In AndroidWorld, UI-TARS achieves 46.6, surpassing GPT-4o (34.5). UI-TARS incorporates several key innovations: (1) Enhanced Perception: leveraging a large-scale dataset of GUI screenshots for context-aware understanding of UI elements and precise captioning; (2) Unified Action Modeling, which standardizes actions into a unified space across platforms and achieves precise grounding and interaction through large-scale action traces; (3) System-2 Reasoning, which incorporates deliberate reasoning into multi-step decision making, involving multiple reasoning patterns such as task decomposition, reflection thinking, milestone recognition, etc. (4) Iterative Training with Reflective Online Traces, which addresses the data bottleneck by automatically collecting, filtering, and reflectively refining new interaction traces on hundreds of virtual machines. Through iterative training and reflection tuning, UI-TARS continuously learns from its mistakes and adapts to unforeseen situations with minimal human intervention. We also analyze the evolution path of GUI agents to guide the further development of this domain.