OmniParser for Pure Vision Based GUI Agent
Yadong Lu, Jianwei Yang, Yelong Shen, Ahmed Awadallah
2024-08-02

Summary
This paper introduces OmniParser, a new method designed to help computer programs understand and interact with user interfaces (UIs) by parsing screenshots into structured elements. This enhances the capabilities of AI models like GPT-4V in performing tasks on various operating systems and applications.
What's the problem?
Current AI models struggle to effectively interact with user interfaces because they lack a reliable way to identify clickable icons and understand the functions of different elements in a screenshot. This makes it difficult for these models to perform actions accurately, limiting their usefulness in real-world applications.
What's the solution?
To solve this problem, the authors developed OmniParser, which uses two specialized datasets: one for detecting interactable icons and another for describing their functions. OmniParser fine-tunes detection and captioning models to accurately identify actionable areas in screenshots and explain what those elements do. This allows the GPT-4V model to generate precise actions based on the visual information in the screenshots, improving its performance significantly on various benchmarks.
Why it matters?
This research is important because it enhances how AI can interact with user interfaces, making it easier for developers to create intelligent agents that can operate across different applications without needing extra information. By improving the understanding of UIs, OmniParser opens up new possibilities for automation, accessibility, and user experience in software applications.
Abstract
The recent success of large vision language models shows great potential in driving the agent system operating on user interfaces. However, we argue that the power multimodal models like GPT-4V as a general agent on multiple operating systems across different applications is largely underestimated due to the lack of a robust screen parsing technique capable of: 1) reliably identifying interactable icons within the user interface, and 2) understanding the semantics of various elements in a screenshot and accurately associate the intended action with the corresponding region on the screen. To fill these gaps, we introduce OmniParser, a comprehensive method for parsing user interface screenshots into structured elements, which significantly enhances the ability of GPT-4V to generate actions that can be accurately grounded in the corresponding regions of the interface. We first curated an interactable icon detection dataset using popular webpages and an icon description dataset. These datasets were utilized to fine-tune specialized models: a detection model to parse interactable regions on the screen and a caption model to extract the functional semantics of the detected elements. OmniParser significantly improves GPT-4V's performance on ScreenSpot benchmark. And on Mind2Web and AITW benchmark, OmniParser with screenshot only input outperforms the GPT-4V baselines requiring additional information outside of screenshot.