OS-ATLAS: A Foundation Action Model for Generalist GUI Agents

Zhiyong Wu, Zhenyu Wu, Fangzhi Xu, Yian Wang, Qiushi Sun, Chengyou Jia, Kanzhi Cheng, Zichen Ding, Liheng Chen, Paul Pu Liang, Yu Qiao

2024-11-04

OS-ATLAS: A Foundation Action Model for Generalist GUI Agents

Summary

This paper presents OS-ATLAS, a new model designed to help generalist graphical user interface (GUI) agents better understand and interact with different software environments. It focuses on improving how these agents can recognize and act on various GUI elements across multiple platforms.

What's the problem?

Many existing GUI agents depend on powerful commercial models that are not open-source, making it hard for researchers to use them. Open-source models often perform poorly, especially when dealing with unfamiliar or less common GUI layouts. This limits the ability of researchers to develop effective agents that can work in diverse environments.

What's the solution?

The authors developed OS-ATLAS, an open-source foundational model that improves GUI grounding (understanding where elements are in a GUI) and can handle out-of-distribution tasks (working with interfaces it hasn't seen before). They created a toolkit to generate a large dataset of over 13 million GUI elements from various platforms, including Windows, MacOS, and Android. This dataset helps train OS-ATLAS to recognize and interact with different GUIs effectively.

Why it matters?

This research is important because it provides a robust framework for building better GUI agents that can operate in many different environments. By making this model and dataset available to the research community, it encourages further development and innovation in the field of AI, particularly in creating more versatile and capable software agents.

Abstract

Existing efforts in building GUI agents heavily rely on the availability of robust commercial Vision-Language Models (VLMs) such as GPT-4o and GeminiProVision. Practitioners are often reluctant to use open-source VLMs due to their significant performance lag compared to their closed-source counterparts, particularly in GUI grounding and Out-Of-Distribution (OOD) scenarios. To facilitate future research in this area, we developed OS-Atlas - a foundational GUI action model that excels at GUI grounding and OOD agentic tasks through innovations in both data and modeling. We have invested significant engineering effort in developing an open-source toolkit for synthesizing GUI grounding data across multiple platforms, including Windows, Linux, MacOS, Android, and the web. Leveraging this toolkit, we are releasing the largest open-source cross-platform GUI grounding corpus to date, which contains over 13 million GUI elements. This dataset, combined with innovations in model training, provides a solid foundation for OS-Atlas to understand GUI screenshots and generalize to unseen interfaces. Through extensive evaluation across six benchmarks spanning three different platforms (mobile, desktop, and web), OS-Atlas demonstrates significant performance improvements over previous state-of-the-art models. Our evaluation also uncovers valuable insights into continuously improving and scaling the agentic capabilities of open-source VLMs.

View Paper