LLM-Powered GUI Agents in Phone Automation: Surveying Progress and Prospects
Guangyi Liu, Pengxiang Zhao, Liang Liu, Yaxuan Guo, Han Xiao, Weifeng Lin, Yuxiang Chai, Yue Han, Shuai Ren, Hao Wang, Xiaoyu Liang, Wenhao Wang, Tianze Wu, Linghao Li, Hao Wang, Guanjing Xiong, Yong Liu, Hongsheng Li
2025-04-29
Summary
This paper talks about how large language models (LLMs) are now being used to control and automate tasks on phone screens, making these systems smarter and able to understand more complicated instructions than older, script-based automation.
What's the problem?
The problem is that traditional phone automation systems are usually limited because they rely on simple scripts, can't handle changes in apps very well, and often struggle to understand what the user really wants when instructions are unclear or complicated.
What's the solution?
The researchers reviewed and explained how new LLM-powered agents are able to understand both language and images, make better decisions, and adapt to different apps and tasks. These agents can figure out what a user wants just by reading or listening to their instructions, and they can keep working even when apps update or change their look.
Why it matters?
This matters because it means phone automation can become much more flexible, powerful, and user-friendly, helping people save time and effort on everyday tasks, and making technology more accessible for everyone.
Abstract
LLM-driven phone GUI agents evolve from script-based automation to intelligent systems, using advanced language understanding, multimodal perception, and robust decision-making to address generality, maintenance overhead, and intent comprehension challenges.