POINTS-GUI-G: GUI-Grounding Journey

Zhongyin Zhao, Yuan Liu, Yikun Liu, Haicheng Wang, Le Tian, Xiao Zhou, Yangxiu You, Zilin Yu, Yang Yu, Jie Zhou

2026-02-09

Summary

This paper focuses on building computer programs, called GUI agents, that can automatically use computer interfaces – things like windows, buttons, and text boxes – to complete tasks for you, like shopping online or booking a flight.

What's the problem?

Currently, getting these agents to reliably *understand* what they're looking at on the screen is a big challenge. They need to accurately identify things like buttons and text fields to click or type in them. Previous approaches started with models that already had some ability to 'see' and understand images, but this research wanted to build this ability from scratch, starting with a model that wasn't very good at understanding images in the first place.

What's the solution?

The researchers created a new model called POINTS-GUI-G-8B that excels at identifying interface elements. They achieved this through three main improvements: first, they carefully combined and cleaned up existing datasets used to train the model, adding techniques to make the data more useful; second, they improved how the model learns to 'see' by continuously refining its image processing and ensuring consistent image quality during training and use; and third, they used a technique called reinforcement learning, which normally helps with decision-making, but surprisingly helped the model become more precise in identifying screen elements, because it's easy to verify if the model correctly identified something.

Why it matters?

This work is important because it shows it’s possible to build effective GUI agents even without starting with a sophisticated 'vision' system. This opens the door to creating more accessible and adaptable automation tools that can help people with repetitive digital tasks, and it demonstrates a new way to use reinforcement learning to improve visual understanding.

Abstract

The rapid advancement of vision-language models has catalyzed the emergence of GUI agents, which hold immense potential for automating complex tasks, from online shopping to flight booking, thereby alleviating the burden of repetitive digital workflows. As a foundational capability, GUI grounding is typically established as a prerequisite for end-to-end task execution. It enables models to precisely locate interface elements, such as text and icons, to perform accurate operations like clicking and typing. Unlike prior works that fine-tune models already possessing strong spatial awareness (e.g., Qwen3-VL), we aim to master the full technical pipeline by starting from a base model with minimal grounding ability, such as POINTS-1.5. We introduce POINTS-GUI-G-8B, which achieves state-of-the-art performance with scores of 59.9 on ScreenSpot-Pro, 66.0 on OSWorld-G, 95.7 on ScreenSpot-v2, and 49.9 on UI-Vision. Our model's success is driven by three key factors: (1) Refined Data Engineering, involving the unification of diverse open-source datasets format alongside sophisticated strategies for augmentation, filtering, and difficulty grading; (2) Improved Training Strategies, including continuous fine-tuning of the vision encoder to enhance perceptual accuracy and maintaining resolution consistency between training and inference; and (3) Reinforcement Learning (RL) with Verifiable Rewards. While RL is traditionally used to bolster reasoning, we demonstrate that it significantly improves precision in the perception-intensive GUI grounding task. Furthermore, GUI grounding provides a natural advantage for RL, as rewards are easily verifiable and highly accurate.

View Paper