MMBench-GUI: Hierarchical Multi-Platform Evaluation Framework for GUI Agents

Xuehui Wang, Zhenyu Wu, JingJing Xie, Zichen Ding, Bowen Yang, Zehao Li, Zhaoyang Liu, Qingyun Li, Xuan Dong, Zhe Chen, Weiyun Wang, Xiangyu Zhao, Jixuan Chen, Haodong Duan, Tianbao Xie, Chenyu Yang, Shiqian Su, Yue Yu, Yuan Huang, Yiqian Liu, Xiao Zhang, Yanting Zhang

2025-07-28

MMBench-GUI: Hierarchical Multi-Platform Evaluation Framework for GUI
Agents

Summary

This paper talks about MMBench-GUI, a new way to test how well AI agents can use graphical user interfaces (GUIs) across many platforms like Windows, macOS, Android, and the web by checking their skills on different levels.

What's the problem?

Existing tests usually focus on just one part of what a GUI agent should do, like clicking buttons or finishing tasks, without understanding how all the different skills work together, and they don’t measure how efficient the agents are.

What's the solution?

The researchers created a hierarchical benchmark with four levels that test everything from understanding GUI content to collaborating on complex tasks. They also introduced a metric called Efficiency-Quality Area to see how well agents perform tasks while being efficient, and they tested agents across multiple real platforms with many tasks.

Why it matters?

This matters because it helps developers know exactly where AI agents need to improve to handle real-world GUI operations better, making automation smarter, more reliable, and more efficient across different devices and systems.

Abstract

MMBench-GUI evaluates GUI automation agents across multiple platforms using a hierarchical benchmark and Efficiency-Quality Area metric, highlighting the importance of visual grounding, task planning, and efficiency.

View Paper