MobA: A Two-Level Agent System for Efficient Mobile Task Automation

Zichen Zhu, Hao Tang, Yansi Li, Kunyao Lan, Yixuan Jiang, Hao Zhou, Yixiao Wang, Situo Zhang, Liangtai Sun, Lu Chen, Kai Yu

2024-10-18

MobA: A Two-Level Agent System for Efficient Mobile Task Automation

Summary

This paper introduces MobA, a new two-level agent system designed to improve mobile task automation by using advanced language models for better understanding and decision-making.

What's the problem?

Current mobile assistants often struggle with understanding complex user commands and navigating various interfaces. They are limited because they mainly rely on system APIs and have difficulty processing diverse instructions, which makes them less effective in performing tasks that users expect them to handle easily.

What's the solution?

To solve these issues, the authors developed MobA, which features a two-level agent architecture. The high-level Global Agent (GA) interprets user commands, remembers past interactions, and plans tasks. The low-level Local Agent (LA) takes these plans and predicts specific actions needed to complete the tasks. Additionally, MobA includes a Reflection Module that helps it learn from previous experiences and efficiently tackle new, complex tasks that it hasn't encountered before. This combination allows MobA to perform tasks more effectively and efficiently.

Why it matters?

This research is important because it enhances the capabilities of mobile assistants, making them more useful for users. By improving how these systems understand and execute tasks, MobA can lead to a better user experience in mobile applications, helping people manage their daily activities more smoothly and effectively.

Abstract

Current mobile assistants are limited by dependence on system APIs or struggle with complex user instructions and diverse interfaces due to restricted comprehension and decision-making abilities. To address these challenges, we propose MobA, a novel Mobile phone Agent powered by multimodal large language models that enhances comprehension and planning capabilities through a sophisticated two-level agent architecture. The high-level Global Agent (GA) is responsible for understanding user commands, tracking history memories, and planning tasks. The low-level Local Agent (LA) predicts detailed actions in the form of function calls, guided by sub-tasks and memory from the GA. Integrating a Reflection Module allows for efficient task completion and enables the system to handle previously unseen complex tasks. MobA demonstrates significant improvements in task execution efficiency and completion rate in real-life evaluations, underscoring the potential of MLLM-empowered mobile assistants.

View Paper