MAI-UI Technical Report: Real-World Centric Foundation GUI Agents

Hanzhang Zhou, Xu Zhang, Panrong Tong, Jianan Zhang, Liangyu Chen, Quyu Kong, Chenglin Cai, Chen Liu, Yue Wang, Jingren Zhou, Steven Hoi

2025-12-29

MAI-UI Technical Report: Real-World Centric Foundation GUI Agents

Summary

This paper introduces MAI-UI, a new set of computer programs, called agents, designed to interact with computer interfaces like apps on your phone. These agents come in different sizes and aim to make using computers more intuitive and natural.

What's the problem?

Currently, making these helpful agents is really hard. The biggest issues are that they don't naturally interact with users, they can only 'see' the interface and can't do things outside of it, there's no good way to actually run them in the real world, and they often break when things change or aren't exactly as expected. Basically, existing agents aren't very reliable or user-friendly.

What's the solution?

The researchers tackled these problems by creating a system that constantly learns from user interactions and can use tools within the interface. They also built a way for the agent to split tasks between your phone and a more powerful cloud computer, depending on what's needed. Finally, they used a special type of learning called 'reinforcement learning' to help the agent get better over time, and they figured out how to speed up this learning process by running many simulations at once.

Why it matters?

This work is important because it significantly improves how well these agents can understand and interact with computer interfaces. MAI-UI outperforms other existing agents on several tests, meaning it's better at tasks like identifying objects on the screen and navigating apps. This could lead to a future where computers are much easier to use, and where agents can help us with everyday tasks automatically.

Abstract

The development of GUI agents could revolutionize the next generation of human-computer interaction. Motivated by this vision, we present MAI-UI, a family of foundation GUI agents spanning the full spectrum of sizes, including 2B, 8B, 32B, and 235B-A22B variants. We identify four key challenges to realistic deployment: the lack of native agent-user interaction, the limits of UI-only operation, the absence of a practical deployment architecture, and brittleness in dynamic environments. MAI-UI addresses these issues with a unified methodology: a self-evolving data pipeline that expands the navigation data to include user interaction and MCP tool calls, a native device-cloud collaboration system routes execution by task state, and an online RL framework with advanced optimizations to scale parallel environments and context length. MAI-UI establishes new state-of-the-art across GUI grounding and mobile navigation. On grounding benchmarks, it reaches 73.5% on ScreenSpot-Pro, 91.3% on MMBench GUI L2, 70.9% on OSWorld-G, and 49.2% on UI-Vision, surpassing Gemini-3-Pro and Seed1.8 on ScreenSpot-Pro. On mobile GUI navigation, it sets a new SOTA of 76.7% on AndroidWorld, surpassing UI-Tars-2, Gemini-2.5-Pro and Seed1.8. On MobileWorld, MAI-UI obtains 41.7% success rate, significantly outperforming end-to-end GUI models and competitive with Gemini-3-Pro based agentic frameworks. Our online RL experiments show significant gains from scaling parallel environments from 32 to 512 (+5.2 points) and increasing environment step budget from 15 to 50 (+4.3 points). Finally, the native device-cloud collaboration system improves on-device performance by 33%, reduces cloud model calls by over 40%, and preserves user privacy.

View Paper