MobileWorldBench: Towards Semantic World Modeling For Mobile Agents
Shufan Li, Konstantinos Kallidromitis, Akash Gokul, Yusuke Kato, Kazuki Kozuka, Aditya Grover
2025-12-17
Summary
This paper explores a new way to help computer agents learn to interact with apps on phones, like completing tasks in a mobile game or using a productivity app.
What's the problem?
Currently, many AI systems trying to understand and predict what will happen when an agent interacts with a phone app focus on predicting what the screen will *look* like next. This is really hard because phone screens are complex and have lots of visual details. It's difficult for the AI to accurately predict all those pixels, especially when dealing with things like buttons and text.
What's the solution?
Instead of predicting pixels, the researchers used a different approach. They had the AI learn to *describe* what happens when an action is taken, using natural language. For example, instead of predicting the new screen, the AI would predict 'The button was pressed, and the screen changed to the next level.' They created a large dataset called MobileWorld with 1.4 million examples to train these AI systems, and a benchmark called MobileWorldBench to test them. They then built a system that uses these language-based predictions to help the agent plan its actions more effectively.
Why it matters?
This work is important because it shows that AI agents can learn to interact with apps more reliably by focusing on *what* happens, rather than *how* it looks. This makes it easier to build AI that can automate tasks on our phones and potentially help us with everyday activities. It also opens up possibilities for creating more robust and adaptable AI systems that aren't as easily fooled by changes in visual appearance.
Abstract
World models have shown great utility in improving the task performance of embodied agents. While prior work largely focuses on pixel-space world models, these approaches face practical limitations in GUI settings, where predicting complex visual elements in future states is often difficult. In this work, we explore an alternative formulation of world modeling for GUI agents, where state transitions are described in natural language rather than predicting raw pixels. First, we introduce MobileWorldBench, a benchmark that evaluates the ability of vision-language models (VLMs) to function as world models for mobile GUI agents. Second, we release MobileWorld, a large-scale dataset consisting of 1.4M samples, that significantly improves the world modeling capabilities of VLMs. Finally, we propose a novel framework that integrates VLM world models into the planning framework of mobile agents, demonstrating that semantic world models can directly benefit mobile agents by improving task success rates. The code and dataset is available at https://github.com/jacklishufan/MobileWorld