< Explain other AI papers

UI-TARS-2 Technical Report: Advancing GUI Agent with Multi-Turn Reinforcement Learning

Haoming Wang, Haoyang Zou, Huatong Song, Jiazhan Feng, Junjie Fang, Junting Lu, Longxiang Liu, Qinyu Luo, Shihao Liang, Shijue Huang, Wanjun Zhong, Yining Ye, Yujia Qin, Yuwen Xiong, Yuxin Song, Zhiyong Wu, Bo Li, Chen Dun, Chong Liu, Fuxing Leng, Hanbin Wang, Hao Yu

2025-09-03

UI-TARS-2 Technical Report: Advancing GUI Agent with Multi-Turn Reinforcement Learning

Summary

This paper introduces UI-TARS-2, a new artificial intelligence agent designed to interact with computer interfaces like you do with a mouse and keyboard. It's a significant improvement over previous versions and shows a lot of promise for automating tasks on computers.

What's the problem?

Creating AI agents that can reliably use computer interfaces is really hard. Existing agents struggle with needing huge amounts of training data, remembering what they've done over multiple steps, being limited to just interacting with the screen, and maintaining a stable environment while learning. Basically, they aren't very good at complex, real-world tasks that require a series of actions and adapting to changes.

What's the solution?

The researchers tackled these problems by building UI-TARS-2 using a few key ideas. First, they created a system to automatically generate lots of training data. Second, they improved the way the agent learns over multiple steps, making it more stable. Third, they expanded the agent’s environment to include access to files and command lines, not just the graphical interface. Finally, they built a platform to test the agent on a large scale. They then tested UI-TARS-2 on several different tasks, including navigating websites, using operating systems, playing games, and even doing some basic software engineering.

Why it matters?

This work is important because it represents a big step forward in creating AI agents that can actually *use* computers like humans do. This could lead to automation of many everyday tasks, making computers easier to use, and potentially helping with complex jobs like software development. The agent performs very well compared to other AI systems and even gets close to human-level performance on some tasks, showing its potential for real-world applications.

Abstract

The development of autonomous agents for graphical user interfaces (GUIs) presents major challenges in artificial intelligence. While recent advances in native agent models have shown promise by unifying perception, reasoning, action, and memory through end-to-end learning, open problems remain in data scalability, multi-turn reinforcement learning (RL), the limitations of GUI-only operation, and environment stability. In this technical report, we present UI-TARS-2, a native GUI-centered agent model that addresses these challenges through a systematic training methodology: a data flywheel for scalable data generation, a stabilized multi-turn RL framework, a hybrid GUI environment that integrates file systems and terminals, and a unified sandbox platform for large-scale rollouts. Empirical evaluation demonstrates that UI-TARS-2 achieves significant improvements over its predecessor UI-TARS-1.5. On GUI benchmarks, it reaches 88.2 on Online-Mind2Web, 47.5 on OSWorld, 50.6 on WindowsAgentArena, and 73.3 on AndroidWorld, outperforming strong baselines such as Claude and OpenAI agents. In game environments, it attains a mean normalized score of 59.8 across a 15-game suite-roughly 60% of human-level performance-and remains competitive with frontier proprietary models (e.g., OpenAI o3) on LMGame-Bench. Additionally, the model can generalize to long-horizon information-seeking tasks and software engineering benchmarks, highlighting its robustness across diverse agent tasks. Detailed analyses of training dynamics further provide insights into achieving stability and efficiency in large-scale agent RL. These results underscore UI-TARS-2's potential to advance the state of GUI agents and exhibit strong generalization to real-world interactive scenarios.