DigiData: Training and Evaluating General-Purpose Mobile Control Agents
Yuxuan Sun, Manchen Wang, Shengyi Qian, William R. Wong, Eric Gan, Pierluca D'Oro, Alejandro Castillejo Munoz, Sneha Silwal, Pedro Matias, Nitin Kamra, Satwik Kottur, Nick Raines, Xuanyi Zhao, Joy Chen, Joseph Greer, Andrea Madotto, Allen Bolourchi, James Valori, Kevin Carlberg, Karl Ridgeway, Joseph Tighe
2025-11-11
Summary
This paper introduces a new dataset and evaluation methods for building AI agents that can control smartphones and other mobile devices, aiming to make interacting with these devices more natural and efficient.
What's the problem?
Currently, creating AI agents that can effectively use apps is difficult because there aren't enough good datasets to train them on, and the ways we currently test these agents aren't very reliable. Existing datasets often lack clear goals or don't cover the full range of what apps can do, and simply measuring how many steps an agent takes to complete a task doesn't tell us if it's actually doing a good job.
What's the solution?
The researchers created DigiData, a large and detailed dataset specifically for training mobile control agents. This dataset was built by carefully exploring all the features of different apps to create diverse and complex goals for the agents to achieve. They also developed DigiData-Bench, a new way to test these agents, using more dynamic and AI-assisted evaluation methods instead of just counting steps.
Why it matters?
This work is important because it provides the tools needed to build more capable and user-friendly AI agents for mobile devices. Better agents could automate tasks, assist people with disabilities, or simply make everyday smartphone use easier and more intuitive, ultimately improving how we interact with technology.
Abstract
AI agents capable of controlling user interfaces have the potential to transform human interaction with digital devices. To accelerate this transformation, two fundamental building blocks are essential: high-quality datasets that enable agents to achieve complex and human-relevant goals, and robust evaluation methods that allow researchers and practitioners to rapidly enhance agent performance. In this paper, we introduce DigiData, a large-scale, high-quality, diverse, multi-modal dataset designed for training mobile control agents. Unlike existing datasets, which derive goals from unstructured interactions, DigiData is meticulously constructed through comprehensive exploration of app features, resulting in greater diversity and higher goal complexity. Additionally, we present DigiData-Bench, a benchmark for evaluating mobile control agents on real-world complex tasks. We demonstrate that the commonly used step-accuracy metric falls short in reliably assessing mobile control agents and, to address this, we propose dynamic evaluation protocols and AI-powered evaluations as rigorous alternatives for agent assessment. Our contributions aim to significantly advance the development of mobile control agents, paving the way for more intuitive and effective human-device interactions.