Lightweight Neural App Control

Filippos Christianos, Georgios Papoudakis, Thomas Coste, Jianye Hao, Jun Wang, Kun Shao

2024-10-24

Summary

This paper introduces a new system called Lightweight Multi-modal App Control (LiMAC) that helps mobile phones interact more efficiently with various apps by using intelligent agents.

What's the problem?

Mobile phones often have to handle many different tasks and apps, but current methods for controlling these apps can be slow and inefficient. This can lead to frustration for users who want their phones to respond quickly and accurately to their commands.

What's the solution?

LiMAC uses a combination of a small Action Transformer (AcT) and a fine-tuned vision-language model (VLM) to process user commands. It takes input like text goals and past actions (such as screenshots) to decide what action to take next. By using this lightweight approach, LiMAC can make decisions quickly and execute tasks in real-time, significantly improving the speed and accuracy of app interactions.

Why it matters?

This research is important because it enhances how we control mobile apps, making smartphones smarter and more responsive. With better app control, users can enjoy a smoother experience when using their devices, which can lead to increased productivity and satisfaction.

Abstract

This paper introduces a novel mobile phone control architecture, termed ``app agents", for efficient interactions and controls across various Android apps. The proposed Lightweight Multi-modal App Control (LiMAC) takes as input a textual goal and a sequence of past mobile observations, such as screenshots and corresponding UI trees, to generate precise actions. To address the computational constraints inherent to smartphones, within LiMAC, we introduce a small Action Transformer (AcT) integrated with a fine-tuned vision-language model (VLM) for real-time decision-making and task execution. We evaluate LiMAC on two open-source mobile control datasets, demonstrating the superior performance of our small-form-factor approach against fine-tuned versions of open-source VLMs, such as Florence2 and Qwen2-VL. It also significantly outperforms prompt engineering baselines utilising closed-source foundation models like GPT-4o. More specifically, LiMAC increases the overall action accuracy by up to 19% compared to fine-tuned VLMs, and up to 42% compared to prompt-engineering baselines.

View Paper