VideoGUI: A Benchmark for GUI Automation from Instructional Videos

Kevin Qinghong Lin, Linjie Li, Difei Gao, Qinchen WU, Mingyi Yan, Zhengyuan Yang, Lijuan Wang, Mike Zheng Shou

2024-06-17

VideoGUI: A Benchmark for GUI Automation from Instructional Videos

Summary

This paper introduces VideoGUI, a new benchmark designed to test how well AI systems can automate tasks in graphical user interfaces (GUIs) by learning from instructional videos. It focuses on more complex tasks that require understanding both visual and textual information.

What's the problem?

Most existing methods for automating GUI tasks are limited to simple commands that can be expressed in one sentence, like 'Insert a new slide.' This approach doesn't capture the complexity of real-world software usage, where users often need to perform multiple steps and interact with various GUI elements. As a result, AI systems struggle to handle more complicated tasks that involve professional software like Adobe Photoshop or video editing tools.

What's the solution?

To address this issue, the authors developed VideoGUI, which uses high-quality instructional videos to create a benchmark for evaluating AI assistants. This benchmark includes tasks that require the AI to understand visual information and perform actions based on what it sees. The evaluation process is structured in three levels: high-level planning (breaking down tasks without language), middle-level planning (creating action sequences based on visual states), and atomic action execution (performing specific actions like clicking or typing). By assessing performance at each level, the researchers can identify where AI systems may fail.

Why it matters?

This research is important because it provides a more realistic way to evaluate how well AI can assist users in navigating complex software applications. By focusing on visual-centric tasks, VideoGUI aims to improve the effectiveness of GUI automation tools, which could lead to increased productivity and a better user experience in various fields, including design and video editing.

Abstract

Graphical User Interface (GUI) automation holds significant promise for enhancing human productivity by assisting with computer tasks. Existing task formulations primarily focus on simple tasks that can be specified by a single, language-only instruction, such as "Insert a new slide." In this work, we introduce VideoGUI, a novel multi-modal benchmark designed to evaluate GUI assistants on visual-centric GUI tasks. Sourced from high-quality web instructional videos, our benchmark focuses on tasks involving professional and novel software (e.g., Adobe Photoshop or Stable Diffusion WebUI) and complex activities (e.g., video editing). VideoGUI evaluates GUI assistants through a hierarchical process, allowing for identification of the specific levels at which they may fail: (i) high-level planning: reconstruct procedural subtasks from visual conditions without language descriptions; (ii) middle-level planning: generate sequences of precise action narrations based on visual state (i.e., screenshot) and goals; (iii) atomic action execution: perform specific actions such as accurately clicking designated elements. For each level, we design evaluation metrics across individual dimensions to provide clear signals, such as individual performance in clicking, dragging, typing, and scrolling for atomic action execution. Our evaluation on VideoGUI reveals that even the SoTA large multimodal model GPT4o performs poorly on visual-centric GUI tasks, especially for high-level planning.

View Paper