Aguvis: Unified Pure Vision Agents for Autonomous GUI Interaction
Yiheng Xu, Zekun Wang, Junli Wang, Dunjie Lu, Tianbao Xie, Amrita Saha, Doyen Sahoo, Tao Yu, Caiming Xiong
2024-12-06

Summary
This paper talks about Aguvis, a new system designed to help computers interact with graphical user interfaces (GUIs) by using only visual information, making it easier for them to automate tasks across different platforms.
What's the problem?
Automating tasks in GUIs is difficult because these interfaces can be complex and vary widely between applications. Most existing methods rely on text descriptions of the GUI elements, which can limit how well the system understands and generalizes to different environments.
What's the solution?
Aguvis solves this problem by using a purely vision-based approach. It learns to recognize and interact with visual elements directly from images, rather than relying on text. The system uses a two-step training process: first, it learns to identify general GUI features, and then it focuses on planning and reasoning to perform tasks. This allows Aguvis to work effectively across various platforms without needing separate training for each one.
Why it matters?
This research is significant because it represents a major step toward creating fully autonomous systems that can navigate and use software just like humans do. By focusing on visual understanding instead of text, Aguvis can adapt to different GUIs more easily and could lead to advancements in areas like accessibility tools and automated software testing. The authors have also made their data and models available for others to use and improve upon.
Abstract
Graphical User Interfaces (GUIs) are critical to human-computer interaction, yet automating GUI tasks remains challenging due to the complexity and variability of visual environments. Existing approaches often rely on textual representations of GUIs, which introduce limitations in generalization, efficiency, and scalability. In this paper, we introduce Aguvis, a unified pure vision-based framework for autonomous GUI agents that operates across various platforms. Our approach leverages image-based observations, and grounding instructions in natural language to visual elements, and employs a consistent action space to ensure cross-platform generalization. To address the limitations of previous work, we integrate explicit planning and reasoning within the model, enhancing its ability to autonomously navigate and interact with complex digital environments. We construct a large-scale dataset of GUI agent trajectories, incorporating multimodal reasoning and grounding, and employ a two-stage training pipeline that first focuses on general GUI grounding, followed by planning and reasoning. Through comprehensive experiments, we demonstrate that Aguvis surpasses previous state-of-the-art methods in both offline and real-world online scenarios, achieving, to our knowledge, the first fully autonomous pure vision GUI agent capable of performing tasks independently without collaboration with external closed-source models. We open-sourced all datasets, models, and training recipes to facilitate future research at https://aguvis-project.github.io/.