UItron: Foundational GUI Agent with Advanced Perception and Planning

Zhixiong Zeng, Jing Huang, Liming Zheng, Wenkang Han, Yufeng Zhong, Lei Chen, Longrong Yang, Yingjie Chu, Yuzhi He, Lin Ma

2025-09-01

UItron: Foundational GUI Agent with Advanced Perception and Planning

Summary

This paper introduces UItron, a new foundational model designed to automate tasks on computer and mobile device interfaces, essentially creating an AI that can use apps like a person.

What's the problem?

Building AI agents that can reliably use graphical user interfaces (GUIs) – like clicking buttons and filling out forms – is really hard. There aren't many examples of how to do these tasks, it's difficult to set up systems where the AI can actually *practice* interacting with devices, and existing AI models aren't naturally good at this kind of work. Also, most current solutions don't work well with Chinese apps, which are very popular.

What's the solution?

The researchers created UItron by first improving how the AI 'sees' and understands what's on the screen, and then teaching it how to plan a series of actions to achieve a goal. They used a two-step training process: first, they showed UItron lots of examples of correct actions, and then they let it learn through trial and error in a simulated environment. Crucially, they also built a system to connect the AI to both phones and computers for testing. They also collected a huge dataset of over a million actions taken within popular Chinese apps to specifically improve performance in that area.

Why it matters?

This work is important because creating AI agents that can use devices automatically is a key step towards more advanced AI – the kind that could eventually handle a wide range of tasks without human help. UItron’s success, especially with Chinese apps, shows that it’s possible to build these agents and that focusing on good data and interactive testing environments is essential for making them truly useful in the real world.

Abstract

GUI agent aims to enable automated operations on Mobile/PC devices, which is an important task toward achieving artificial general intelligence. The rapid advancement of VLMs accelerates the development of GUI agents, owing to their powerful capabilities in visual understanding and task planning. However, building a GUI agent remains a challenging task due to the scarcity of operation trajectories, the availability of interactive infrastructure, and the limitation of initial capabilities in foundation models. In this work, we introduce UItron, an open-source foundational model for automatic GUI agents, featuring advanced GUI perception, grounding, and planning capabilities. UItron highlights the necessity of systemic data engineering and interactive infrastructure as foundational components for advancing GUI agent development. It not only systematically studies a series of data engineering strategies to enhance training effects, but also establishes an interactive environment connecting both Mobile and PC devices. In training, UItron adopts supervised finetuning over perception and planning tasks in various GUI scenarios, and then develop a curriculum reinforcement learning framework to enable complex reasoning and exploration for online environments. As a result, UItron achieves superior performance in benchmarks of GUI perception, grounding, and planning. In particular, UItron highlights the interaction proficiency with top-tier Chinese mobile APPs, as we identified a general lack of Chinese capabilities even in state-of-the-art solutions. To this end, we manually collect over one million steps of operation trajectories across the top 100 most popular apps, and build the offline and online agent evaluation environments. Experimental results demonstrate that UItron achieves significant progress in Chinese app scenarios, propelling GUI agents one step closer to real-world application.

View Paper