GUI Agents: A Survey

Dang Nguyen, Jian Chen, Yu Wang, Gang Wu, Namyong Park, Zhengmian Hu, Hanjia Lyu, Junda Wu, Ryan Aponte, Yu Xia, Xintong Li, Jing Shi, Hongjie Chen, Viet Dac Lai, Zhouhang Xie, Sungchul Kim, Ruiyi Zhang, Tong Yu, Mehrab Tanjim, Nesreen K. Ahmed, Puneet Mathur, Seunghyun Yoon

2024-12-19

Summary

This paper talks about GUI agents, which are AI systems that can interact with software applications through graphical user interfaces (GUIs), mimicking how humans use computers.

What's the problem?

As technology advances, there's a growing need for smarter ways to automate interactions between humans and computers. While traditional methods require users to manually click buttons or type commands, this can be time-consuming and inefficient. There is a challenge in creating AI agents that can effectively understand and navigate these GUIs like a human would.

What's the solution?

The authors provide a comprehensive survey of GUI agents powered by large foundation models. They categorize different aspects of these agents, including their benchmarks, evaluation metrics, architectures, and training methods. The paper proposes a unified framework that outlines how these agents perceive their environment, reason about tasks, plan actions, and execute them. It also discusses the challenges these agents face and suggests future directions for research.

Why it matters?

This research is important because it helps improve how we interact with technology. By developing more effective GUI agents, we can automate repetitive tasks, enhance user experiences, and increase productivity across various industries. Understanding the capabilities and limitations of these agents is crucial for advancing AI technology in everyday applications.

Abstract

Graphical User Interface (GUI) agents, powered by Large Foundation Models, have emerged as a transformative approach to automating human-computer interaction. These agents autonomously interact with digital systems or software applications via GUIs, emulating human actions such as clicking, typing, and navigating visual elements across diverse platforms. Motivated by the growing interest and fundamental importance of GUI agents, we provide a comprehensive survey that categorizes their benchmarks, evaluation metrics, architectures, and training methods. We propose a unified framework that delineates their perception, reasoning, planning, and acting capabilities. Furthermore, we identify important open challenges and discuss key future directions. Finally, this work serves as a basis for practitioners and researchers to gain an intuitive understanding of current progress, techniques, benchmarks, and critical open problems that remain to be addressed.

View Paper