MobileWorld: Benchmarking Autonomous Mobile Agents in Agent-User Interactive, and MCP-Augmented Environments

Quyu Kong, Xu Zhang, Zhenyu Yang, Nolan Gao, Chen Liu, Panrong Tong, Chenglin Cai, Hanzhang Zhou, Jianan Zhang, Liangyu Chen, Zhidan Liu, Steven Hoi, Yue Wang

2025-12-23

MobileWorld: Benchmarking Autonomous Mobile Agents in Agent-User Interactive, and MCP-Augmented Environments

Summary

This paper introduces a new, more difficult benchmark called MobileWorld for testing AI agents that use smartphones. It builds upon an existing benchmark, AndroidWorld, but aims to better represent the complexities of how people actually use their phones.

What's the problem?

The current standard benchmark for mobile AI, AndroidWorld, is becoming too easy for advanced AI agents. It also doesn't cover many common phone activities like online shopping or work communication, and it doesn't challenge AI to handle unclear instructions or use multiple apps together to complete a task like a person would.

What's the solution?

The researchers created MobileWorld, which includes over 200 tasks across 20 different apps. These tasks are longer and more complex than those in AndroidWorld, often requiring the AI to switch between apps. They also added tasks that require the AI to interact with a simulated user and use special phone features. To make sure the tests are fair, they created a reliable testing environment and ways to automatically check if the AI completed the tasks correctly, even by looking at the phone's internal data.

Why it matters?

MobileWorld provides a much more realistic and challenging test for AI agents designed to use smartphones. The results show that current AI models struggle with the more complex tasks, especially those involving user interaction and using advanced phone features, which points the way for future research to create smarter and more helpful mobile AI.

Abstract

Among existing online mobile-use benchmarks, AndroidWorld has emerged as the dominant benchmark due to its reproducible environment and deterministic evaluation; however, recent agents achieving over 90% success rates indicate its saturation and motivate the need for a more challenging benchmark. In addition, its environment lacks key application categories, such as e-commerce and enterprise communication, and does not reflect realistic mobile-use scenarios characterized by vague user instructions and hybrid tool usage. To bridge this gap, we introduce MobileWorld, a substantially more challenging benchmark designed to better reflect real-world mobile usage, comprising 201 tasks across 20 applications, while maintaining the same level of reproducible evaluation as AndroidWorld. The difficulty of MobileWorld is twofold. First, it emphasizes long-horizon tasks with cross-application interactions: MobileWorld requires nearly twice as many task-completion steps on average (27.8 vs. 14.3) and includes far more multi-application tasks (62.2% vs. 9.5%) compared to AndroidWorld. Second, MobileWorld extends beyond standard GUI manipulation by introducing novel task categories, including agent-user interaction and MCP-augmented tasks. To ensure robust evaluation, we provide snapshot-based container environment and precise functional verifications, including backend database inspection and task callback APIs. We further develop a planner-executor agentic framework with extended action spaces to support user interactions and MCP calls. Our results reveal a sharp performance drop compared to AndroidWorld, with the best agentic framework and end-to-end model achieving 51.7% and 20.9% success rates, respectively. Our analysis shows that current models struggle significantly with user interaction and MCP calls, offering a strategic roadmap toward more robust, next-generation mobile intelligence.

View Paper