Computer-Use Agents as Judges for Generative User Interface

Kevin Qinghong Lin, Siyuan Hu, Linjie Li, Zhengyuan Yang, Lijuan Wang, Philip Torr, Mike Zheng Shou

2025-11-25

Computer-Use Agents as Judges for Generative User Interface

Summary

This paper explores a new way to design computer interfaces, not for humans, but for AI agents. It proposes using AI agents themselves to judge and improve interfaces created by other AI systems, focusing on whether the interface allows the agent to *actually complete tasks* rather than how pretty it looks.

What's the problem?

Currently, most computer interfaces (like websites and apps) are designed with people in mind, prioritizing things like visual appeal and ease of use for humans. When AI agents try to use these interfaces, they have to work around designs meant for human behavior, which is inefficient. The question is, can we design interfaces specifically for how AI agents operate?

What's the solution?

The researchers created a testing ground called AUI-Gym with 52 different applications and generated 1560 tasks for agents to perform. They then built a system where one AI (the 'Coder') designs an interface, and another AI (the 'CUA') tests it by trying to complete the tasks. The CUA doesn't just say if it *can* do something, but how easily. To help the Coder understand the CUA’s feedback, they created a 'CUA Dashboard' that summarizes the agent’s navigation steps into a simple visual report. This allows the Coder to revise the interface based on the agent’s experience.

Why it matters?

This work is important because it shifts the focus of interface design from what looks good to humans to what works best for AI agents. As AI becomes more common, designing interfaces that are efficient for agents will be crucial, and this research is a step towards making that happen, allowing AI to actively shape the digital world instead of just passively using it.

Abstract

Computer-Use Agents (CUA) are becoming increasingly capable of autonomously operating digital environments through Graphical User Interfaces (GUI). Yet, most GUI remain designed primarily for humans--prioritizing aesthetics and usability--forcing agents to adopt human-oriented behaviors that are unnecessary for efficient task execution. At the same time, rapid advances in coding-oriented language models (Coder) have transformed automatic GUI design. This raises a fundamental question: Can CUA as judges to assist Coder for automatic GUI design? To investigate, we introduce AUI-Gym, a benchmark for Automatic GUI development spanning 52 applications across diverse domains. Using language models, we synthesize 1560 tasks that simulate real-world scenarios. To ensure task reliability, we further develop a verifier that programmatically checks whether each task is executable within its environment. Building on this, we propose a Coder-CUA in Collaboration framework: the Coder acts as Designer, generating and revising websites, while the CUA serves as Judge, evaluating functionality and refining designs. Success is measured not by visual appearance, but by task solvability and CUA navigation success rate. To turn CUA feedback into usable guidance, we design a CUA Dashboard that compresses multi-step navigation histories into concise visual summaries, offering interpretable guidance for iterative redesign. By positioning agents as both designers and judges, our framework shifts interface design toward agent-native efficiency and reliability. Our work takes a step toward shifting agents from passive use toward active participation in digital environments. Our code and dataset are available at https://github.com/showlab/AUI.

View Paper