Mobile-Agent-v3: Foundamental Agents for GUI Automation
Jiabo Ye, Xi Zhang, Haiyang Xu, Haowei Liu, Junyang Wang, Zhaoqing Zhu, Ziwei Zheng, Feiyu Gao, Junjie Cao, Zhengxi Lu, Jitong Liao, Qi Zheng, Fei Huang, Jingren Zhou, Ming Yan
2025-08-22
Summary
This paper introduces GUI-Owl and Mobile-Agent-v3, which are new AI models designed to interact with computer and phone interfaces like a human would, and they perform better than other publicly available models.
What's the problem?
Creating AI that can reliably use graphical user interfaces (GUIs) – things like apps on your phone or programs on your computer – is really hard. Existing AI often struggles with understanding what’s on the screen, planning a series of actions to achieve a goal, and adapting to different environments. Also, getting enough training data to teach these AI agents is time-consuming and expensive because it usually requires people to manually label everything.
What's the solution?
The researchers tackled this problem in a few key ways. First, they built a huge, automated system that can create its own training data by virtually using apps on different operating systems (Android, Windows, macOS, Ubuntu). This system uses GUI-Owl to test itself and improve over time. Second, they designed GUI-Owl to be good at all the important parts of GUI interaction: understanding the screen, planning steps, knowing what actions do, and reasoning about what to do next. Finally, they developed a way to train the AI more efficiently using a technique called reinforcement learning, allowing it to learn from its successes and failures while interacting with the virtual environments. They then built Mobile-Agent-v3 on top of GUI-Owl to further boost performance.
Why it matters?
This work is important because it represents a significant step forward in creating AI assistants that can actually *use* software for us. Imagine an AI that can automatically fill out forms, book appointments, or perform complex tasks on your computer or phone. This technology could make computers much more accessible and automate a lot of tedious work. Because the code is publicly available, other researchers can build upon this work and accelerate progress in the field of AI-powered GUI interaction.
Abstract
This paper introduces GUI-Owl, a foundational GUI agent model that achieves state-of-the-art performance among open-source end-to-end models on ten GUI benchmarks across desktop and mobile environments, covering grounding, question answering, planning, decision-making, and procedural knowledge. GUI-Owl-7B achieves 66.4 on AndroidWorld and 29.4 on OSWorld. Building on this, we propose Mobile-Agent-v3, a general-purpose GUI agent framework that further improves performance to 73.3 on AndroidWorld and 37.7 on OSWorld, setting a new state-of-the-art for open-source GUI agent frameworks. GUI-Owl incorporates three key innovations: (1) Large-scale Environment Infrastructure: a cloud-based virtual environment spanning Android, Ubuntu, macOS, and Windows, enabling our Self-Evolving GUI Trajectory Production framework. This generates high-quality interaction data via automated query generation and correctness validation, leveraging GUI-Owl to refine trajectories iteratively, forming a self-improving loop. It supports diverse data pipelines and reduces manual annotation. (2) Diverse Foundational Agent Capabilities: by integrating UI grounding, planning, action semantics, and reasoning patterns, GUI-Owl supports end-to-end decision-making and can act as a modular component in multi-agent systems. (3) Scalable Environment RL: we develop a scalable reinforcement learning framework with fully asynchronous training for real-world alignment. We also introduce Trajectory-aware Relative Policy Optimization (TRPO) for online RL, achieving 34.9 on OSWorld. GUI-Owl and Mobile-Agent-v3 are open-sourced at https://github.com/X-PLUG/MobileAgent.