Step-GUI Technical Report

Haolong Yan, Jia Wang, Xin Huang, Yeqing Shen, Ziyang Meng, Zhimin Fan, Kaijun Tan, Jin Gao, Lieyu Shi, Mi Yang, Shiliang Yang, Zhirui Wang, Brian Li, Kang An, Chenyang Li, Lei Lei, Mengmeng Duan, Danxun Liang, Guodong Liu, Hang Cheng, Hao Wu, Jie Dong

2025-12-18

Summary

This paper focuses on making computers easier to control using artificial intelligence, specifically by teaching them to interact with graphical user interfaces (GUIs) like those on your phone or computer. It introduces new methods for training these AI systems and ensuring they can perform tasks privately and reliably.

What's the problem?

Training AI to use GUIs effectively is hard because it requires a lot of labeled data showing the AI what to do. Getting this data is expensive and can be unreliable if humans aren't carefully checking the AI's actions. Also, as these AI systems get better, there's a need to make sure they can work across different devices and protect your personal information while doing so.

What's the solution?

The researchers developed a system called 'Calibrated Step Reward System' that lets the AI learn from its *own* attempts at using GUIs, automatically checking and correcting its work to create high-quality training data at a much lower cost. They then built a family of AI models, 'Step-GUI', that perform very well on GUI tasks. To address privacy concerns, they created 'GUI-MCP', a system that allows the AI to handle sensitive tasks directly on your device without sending your data to the cloud. Finally, they created a new benchmark called 'AndroidDaily' to test the AI on realistic, everyday phone tasks.

Why it matters?

This work is important because it makes it more practical to build AI assistants that can automate tasks on your devices, like managing apps or completing actions within programs. By reducing the cost of training data and prioritizing privacy, it brings us closer to having helpful AI tools that can seamlessly integrate into our daily lives without compromising our security.

Abstract

Recent advances in multimodal large language models unlock unprecedented opportunities for GUI automation. However, a fundamental challenge remains: how to efficiently acquire high-quality training data while maintaining annotation reliability? We introduce a self-evolving training pipeline powered by the Calibrated Step Reward System, which converts model-generated trajectories into reliable training signals through trajectory-level calibration, achieving >90% annotation accuracy with 10-100x lower cost. Leveraging this pipeline, we introduce Step-GUI, a family of models (4B/8B) that achieves state-of-the-art GUI performance (8B: 80.2% AndroidWorld, 48.5% OSWorld, 62.6% ScreenShot-Pro) while maintaining robust general capabilities. As GUI agent capabilities improve, practical deployment demands standardized interfaces across heterogeneous devices while protecting user privacy. To this end, we propose GUI-MCP, the first Model Context Protocol for GUI automation with hierarchical architecture that combines low-level atomic operations and high-level task delegation to local specialist models, enabling high-privacy execution where sensitive data stays on-device. Finally, to assess whether agents can handle authentic everyday usage, we introduce AndroidDaily, a benchmark grounded in real-world mobile usage patterns with 3146 static actions and 235 end-to-end tasks across high-frequency daily scenarios (8B: static 89.91%, end-to-end 52.50%). Our work advances the development of practical GUI agents and demonstrates strong potential for real-world deployment in everyday digital interactions.

View Paper