MMBench-GUI: Hierarchical Multi-Platform Evaluation Framework for GUI Agents
Xuehui Wang, Zhenyu Wu, JingJing Xie, Zichen Ding, Bowen Yang, Zehao Li, Zhaoyang Liu, Qingyun Li, Xuan Dong, Zhe Chen, Weiyun Wang, Xiangyu Zhao, Jixuan Chen, Haodong Duan, Tianbao Xie, Chenyu Yang, Shiqian Su, Yue Yu, Yuan Huang, Yiqian Liu, Xiao Zhang, Yanting Zhang
2025-07-28
Summary
This paper talks about MMBench-GUI, a new way to test how well AI agents can use graphical user interfaces (GUIs) across many platforms like Windows, macOS, Android, and the web by checking their skills on different levels.
What's the problem?
Existing tests usually focus on just one part of what a GUI agent should do, like clicking buttons or finishing tasks, without understanding how all the different skills work together, and they don’t measure how efficient the agents are.
What's the solution?
The researchers created a hierarchical benchmark with four levels that test everything from understanding GUI content to collaborating on complex tasks. They also introduced a metric called Efficiency-Quality Area to see how well agents perform tasks while being efficient, and they tested agents across multiple real platforms with many tasks.
Why it matters?
This matters because it helps developers know exactly where AI agents need to improve to handle real-world GUI operations better, making automation smarter, more reliable, and more efficient across different devices and systems.
Abstract
MMBench-GUI evaluates GUI automation agents across multiple platforms using a hierarchical benchmark and Efficiency-Quality Area metric, highlighting the importance of visual grounding, task planning, and efficiency.