AndroidLab: Training and Systematic Benchmarking of Android Autonomous Agents

Yifan Xu, Xiao Liu, Xueqiao Sun, Siyi Cheng, Hao Yu, Hanyu Lai, Shudan Zhang, Dan Zhang, Jie Tang, Yuxiao Dong

2024-11-05

AndroidLab: Training and Systematic Benchmarking of Android Autonomous Agents

Summary

This paper introduces AndroidLab, a new framework designed to train and evaluate Android autonomous agents effectively. It creates a structured environment for these agents to learn and perform tasks using both language and multimodal models.

What's the problem?

While autonomous agents are becoming more important for interacting with technology, existing methods for training and evaluating Android agents are not systematic. Many studies focus on either open-source or closed-source models but do not provide a comprehensive approach that includes both types. This lack of structure makes it difficult to assess how well these agents perform across different tasks and environments.

What's the solution?

AndroidLab addresses these issues by providing a systematic framework that includes a variety of tools and environments for training Android agents. It features predefined virtual Android devices and includes 138 tasks across nine different applications. The framework supports both large language models (LLMs) and multimodal models (LMMs), allowing for diverse learning experiences. By using this environment, the authors developed an Android Instruction dataset and trained several open-source models, significantly improving their success rates in completing tasks.

Why it matters?

This research is important because it enhances the capabilities of Android autonomous agents, making them more effective at performing real-world tasks. By providing a structured way to train and evaluate these agents, AndroidLab can lead to better technology that interacts with users more intelligently, ultimately benefiting areas like customer service, personal assistants, and smart home systems.

Abstract

Autonomous agents have become increasingly important for interacting with the real world. Android agents, in particular, have been recently a frequently-mentioned interaction method. However, existing studies for training and evaluating Android agents lack systematic research on both open-source and closed-source models. In this work, we propose AndroidLab as a systematic Android agent framework. It includes an operation environment with different modalities, action space, and a reproducible benchmark. It supports both large language models (LLMs) and multimodal models (LMMs) in the same action space. AndroidLab benchmark includes predefined Android virtual devices and 138 tasks across nine apps built on these devices. By using the AndroidLab environment, we develop an Android Instruction dataset and train six open-source LLMs and LMMs, lifting the average success rates from 4.59% to 21.50% for LLMs and from 1.93% to 13.28% for LMMs. AndroidLab is open-sourced and publicly available at https://github.com/THUDM/Android-Lab.

View Paper