MAS-Bench: A Unified Benchmark for Shortcut-Augmented Hybrid Mobile GUI Agents
Pengxiang Zhao, Guangyi Liu, Yaozhen Liang, Weiqing He, Zhengxi Lu, Yuehao Huang, Yaxuan Guo, Kexin Zhang, Hao Wang, Liang Liu, Yong Liu
2025-09-09
Summary
This paper introduces MAS-Bench, a new way to test and compare 'hybrid agents' – programs that interact with apps on your phone or computer using a mix of regular screen taps and faster, more direct methods like shortcuts and APIs.
What's the problem?
Currently, there wasn't a good, standardized way to measure how well these hybrid agents perform. It's easy to test agents that *only* use screen taps, but harder to evaluate how well they can *learn* to use shortcuts to make things faster. Researchers needed a benchmark to fairly compare different approaches to building these smarter agents.
What's the solution?
The researchers created MAS-Bench, which includes 139 challenging tasks across 11 popular apps on phones. These tasks can be done by just tapping the screen, but are much quicker if the agent figures out how to use shortcuts. MAS-Bench also provides a collection of 88 existing shortcuts and ways to measure how well an agent finds and uses them. They then tested agents using this benchmark and showed that agents using shortcuts were much more successful and efficient.
Why it matters?
MAS-Bench is important because it provides a common ground for researchers to develop and test better GUI agents. By having a standard benchmark, it’s easier to compare different techniques and ultimately build smarter, faster, and more helpful assistants for our phones and computers.
Abstract
To enhance the efficiency of GUI agents on various platforms like smartphones and computers, a hybrid paradigm that combines flexible GUI operations with efficient shortcuts (e.g., API, deep links) is emerging as a promising direction. However, a framework for systematically benchmarking these hybrid agents is still underexplored. To take the first step in bridging this gap, we introduce MAS-Bench, a benchmark that pioneers the evaluation of GUI-shortcut hybrid agents with a specific focus on the mobile domain. Beyond merely using predefined shortcuts, MAS-Bench assesses an agent's capability to autonomously generate shortcuts by discovering and creating reusable, low-cost workflows. It features 139 complex tasks across 11 real-world applications, a knowledge base of 88 predefined shortcuts (APIs, deep-links, RPA scripts), and 7 evaluation metrics. The tasks are designed to be solvable via GUI-only operations, but can be significantly accelerated by intelligently embedding shortcuts. Experiments show that hybrid agents achieve significantly higher success rates and efficiency than their GUI-only counterparts. This result also demonstrates the effectiveness of our method for evaluating an agent's shortcut generation capabilities. MAS-Bench fills a critical evaluation gap, providing a foundational platform for future advancements in creating more efficient and robust intelligent agents.