VLAA-GUI: Knowing When to Stop, Recover, and Search, A Modular Framework for GUI Automation

Qijun Han, Haoqin Tu, Zijun Wang, Haoyue Dai, Yiyang Zhou, Nancy Lau, Alvaro A. Cardenas, Yuhui Xu, Ran Xu, Caiming Xiong, Zeyu Zheng, Huaxiu Yao, Yuyin Zhou, Cihang Xie

2026-04-24

VLAA-GUI: Knowing When to Stop, Recover, and Search, A Modular Framework for GUI Automation

Summary

This paper introduces a new system called VLAA-GUI designed to make AI agents better at using computer graphical user interfaces (GUIs), like the windows and buttons you see on your computer screen. It focuses on making these agents more reliable and preventing them from getting stuck or falsely claiming to have finished a task.

What's the problem?

When you try to get an AI to automate tasks on a computer, two big problems often occur. First, the AI might think it's done a task when it actually hasn't, stopping prematurely without actually achieving the goal. Second, the AI can get stuck in a loop, repeatedly trying the same things that don't work, wasting time and resources. Imagine an AI trying to open a file, failing repeatedly, but just keeps clicking the same button over and over.

What's the solution?

VLAA-GUI tackles these problems with three main parts. First, it has a 'Completeness Verifier' that double-checks if the AI *really* finished the task by looking for visual proof on the screen. Second, a 'Loop Breaker' steps in if the AI keeps failing, changing its approach or trying a different strategy. Finally, a 'Search Agent' allows the AI to look up instructions online if it encounters something unfamiliar. They also added tools for coding and making sure the AI's actions are precise when needed. This system was tested with several powerful AI models on tasks involving both Windows and Linux operating systems.

Why it matters?

This research is important because it makes AI agents much more capable of automating real-world computer tasks. By preventing premature stopping and endless loops, VLAA-GUI allows AI to reliably complete complex actions, and in some cases, even perform better than a human could. This could lead to more efficient automation in many areas, from customer service to data analysis.

Abstract

Autonomous GUI agents face two fundamental challenges: early stopping, where agents prematurely declare success without verifiable evidence, and repetitive loops, where agents cycle through the same failing actions without recovery. We present VLAA-GUI, a modular GUI agentic framework built around three integrated components that guide the system on when to Stop, Recover, and Search. First, a mandatory Completeness Verifier enforces UI-observable success criteria and verification at every finish step -- with an agent-level verifier that cross-examines completion claims with decision rules, rejecting those lacking direct visual evidence. Second, a mandatory Loop Breaker provides multi-tier filtering: switching interaction mode after repeated failures, forcing strategy changes after persistent screen-state recurrence, and binding reflection signals to strategy shifts. Third, an on-demand Search Agent searches online for unfamiliar workflows by directly querying a capable LLM with search ability, returning results as plain text. We additionally integrate a Coding Agent for code-intensive actions and a Grounding Agent for precise action grounding, both invoked on demand when required. We evaluate VLAA-GUI across five top-tier backbones, including Opus 4.5, 4.6 and Gemini 3.1 Pro, on two benchmarks with Linux and Windows tasks, achieving top performance on both (77.5% on OSWorld and 61.0% on WindowsAgentArena). Notably, three of the five backbones surpass human performance (72.4%) on OSWorld in a single pass. Ablation studies show that all three proposed components consistently improve a strong backbone, while a weaker backbone benefits more from these tools when the step budget is sufficient. Further analysis also shows that the Loop Breaker nearly halves wasted steps for loop-prone models.

View Paper