VenusBench-GD: A Comprehensive Multi-Platform GUI Benchmark for Diverse Grounding Tasks

Beitong Zhou, Zhexiao Huang, Yuan Guo, Zhangxuan Gu, Tianyu Xia, Zichen Luo, Fei Tang, Dehan Kong, Yanyi Shang, Suling Ou, Zhenlin Guo, Changhua Meng, Shuheng Shen

2025-12-19

VenusBench-GD: A Comprehensive Multi-Platform GUI Benchmark for Diverse Grounding Tasks

Summary

This paper introduces VenusBench-GD, a new and improved way to test how well computer programs can 'understand' what's happening on a computer screen, specifically within graphical user interfaces (GUIs) like buttons, menus, and windows.

What's the problem?

Current methods for testing these 'GUI understanding' programs aren't very good. They either don't have enough examples to learn from, only work on one type of computer system, or require a lot of specialized knowledge about the specific programs being tested. This makes it hard to build programs that can reliably interact with different GUIs in the real world.

What's the solution?

The researchers created VenusBench-GD, a large collection of examples covering many different applications and platforms, and even includes data in two languages. They also developed a careful process to make sure the examples are accurate. Importantly, they broke down the 'understanding' task into different levels of difficulty – basic things like identifying a button, and more advanced things like understanding what a button *does*.

Why it matters?

The results show that general AI models are getting pretty good at the simple tasks, sometimes even better than programs specifically designed for GUIs. However, the more complex tasks still need specialized programs, but those programs tend to memorize the training data instead of truly understanding it. This highlights the need for better testing methods that can evaluate programs on a variety of tasks and ensure they can handle new situations.

Abstract

GUI grounding is a critical component in building capable GUI agents. However, existing grounding benchmarks suffer from significant limitations: they either provide insufficient data volume and narrow domain coverage, or focus excessively on a single platform and require highly specialized domain knowledge. In this work, we present VenusBench-GD, a comprehensive, bilingual benchmark for GUI grounding that spans multiple platforms, enabling hierarchical evaluation for real-word applications. VenusBench-GD contributes as follows: (i) we introduce a large-scale, cross-platform benchmark with extensive coverage of applications, diverse UI elements, and rich annotated data, (ii) we establish a high-quality data construction pipeline for grounding tasks, achieving higher annotation accuracy than existing benchmarks, and (iii) we extend the scope of element grounding by proposing a hierarchical task taxonomy that divides grounding into basic and advanced categories, encompassing six distinct subtasks designed to evaluate models from complementary perspectives. Our experimental findings reveal critical insights: general-purpose multimodal models now match or even surpass specialized GUI models on basic grounding tasks. In contrast, advanced tasks, still favor GUI-specialized models, though they exhibit significant overfitting and poor robustness. These results underscore the necessity of comprehensive, multi-tiered evaluation frameworks.

View Paper