WorldGUI: Dynamic Testing for Comprehensive Desktop GUI Automation

Henry Hengyuan Zhao, Difei Gao, Mike Zheng Shou

2025-02-13

WorldGUI: Dynamic Testing for Comprehensive Desktop GUI Automation

Summary

This paper talks about WorldGUI, a new system for testing how well AI can handle tasks on computer programs by simulating real-life situations where software might not start in its usual state. It also introduces a tool called GUI-Thinker to help AI deal with these challenges.

What's the problem?

AI systems that interact with computer programs often struggle when the software isn't in its default setup, like when a program is already open or in a different state than expected. Current testing methods don't evaluate how well AI can adapt to these real-world scenarios, making it hard to measure their true capabilities.

What's the solution?

The researchers created WorldGUI, a benchmark that includes tasks across 10 popular software applications with varying starting conditions to mimic real user interactions. They also developed GUI-Thinker, a framework that uses critical thinking to help AI adapt to unpredictable situations and improve task planning. Their experiments showed that GUI-Thinker outperformed other models by a significant margin in handling these dynamic tasks.

Why it matters?

This matters because it brings us closer to creating AI systems that can reliably use computers like humans do, even in complex or unexpected situations. By testing and improving AI's ability to adapt to dynamic environments, this research could lead to smarter and more versatile tools for automating tasks on computers, benefiting industries like tech support, data entry, and beyond.

Abstract

Current GUI agents have achieved outstanding performance in GUI element grounding. However, planning remains highly challenging, especially due to sensitivity to the initial state of the environment. Specifically, slight differences in the initial state-such as the target software not being open or the interface not being in its default state-often lead to planning errors. This issue is widespread in real user scenarios, but existing benchmarks fail to evaluate it. In this paper, we present WorldGUI, a novel GUI benchmark that designs GUI tasks with various initial states to simulate real computer-user interactions. The benchmark spans a wide range of tasks across 10 popular software applications, including PowerPoint, VSCode, and Adobe Acrobat. In addition, to address the challenges of dynamic GUI automation tasks, we propose GUI-Thinker, a holistic framework, leveraging a critique mechanism, that effectively manages the unpredictability and complexity of GUI interactions. Experimental results demonstrate that GUI-Thinker significantly outperforms Claude-3.5 (Computer Use) by 14.9% in success rate on WorldGUI tasks. This improvement underscores the effectiveness of our critical-thinking-based framework in enhancing GUI automation.

View Paper