GTA1: GUI Test-time Scaling Agent

Yan Yang, Dongxu Li, Yutong Dai, Yuhao Yang, Ziyang Luo, Zirui Zhao, Zhiyuan Hu, Junzhe Huang, Amrita Saha, Zeyuan Chen, Ran Xu, Liyuan Pan, Caiming Xiong, Junnan Li

2025-07-09

Summary

This paper talks about GTA1, a new AI agent designed to interact with Graphical User Interfaces (GUIs) more accurately by using a method called test-time scaling and reinforcement learning. The agent breaks down user instructions into actions, chooses the best actions by testing multiple options simultaneously, and improves how precisely it clicks or types on the screen.

What's the problem?

The problem is that when AI agents try to follow instructions on complex computer screens, it can be hard to decide the best sequence of actions to reach a goal because many choices might seem right, and it's difficult to accurately interact with tiny or crowded visual elements.

What's the solution?

The researchers created GTA1, which at each step samples many possible actions, uses a judge model to pick the best option, and applies reinforcement learning to train the agent to click or act accurately on the right parts of the screen. This approach allows the agent to avoid bad decisions early on and complete tasks more reliably.

Why it matters?

This matters because it helps AI agents work better with computers and software, making them more useful for automating tasks like managing apps or helping users navigate interfaces, especially in situations where screens are complicated or unclear.

Abstract

A GUI Test-time Scaling Agent addresses task planning ambiguity and visual grounding accuracy in GUI interactions using reinforcement learning and test-time scaling.

View Paper