Improving GUI Grounding with Explicit Position-to-Coordinate Mapping

Suyuchen Wang, Tianyu Zhang, Ahmed Masry, Christopher Pal, Spandana Gella, Bang Liu, Perouz Taslakian

2025-10-06

Improving GUI Grounding with Explicit Position-to-Coordinate Mapping

Summary

This paper focuses on improving how computers understand instructions to interact with graphical user interfaces (GUIs), like clicking buttons or selecting items on a screen. It's about getting computers to accurately pinpoint locations on a screen based on what we tell them to do.

What's the problem?

Currently, computers struggle with this task, especially when the screen resolution is different from what they were trained on. They try to learn the relationship between what something *looks* like and where it *is* on the screen all at once, which is hard. It's like trying to learn a map without any gridlines – you have to memorize every location's exact coordinates. This leads to errors when the 'map' (screen) changes size or layout.

What's the solution?

The researchers came up with two main ideas. First, they introduced special 'RULER' tokens that act like gridlines, giving the computer clear reference points for locations. Instead of guessing coordinates, the computer can say 'a little to the right of the RULER marker'. Second, they improved how the computer understands spatial relationships, making sure it treats width and height equally when figuring out positions. This prevents the computer from being biased towards one dimension over the other.

Why it matters?

This work is important because it makes GUI automation more reliable. If computers can accurately interact with GUIs, it opens the door to more helpful and independent robots and virtual assistants. The improvements are especially significant for high-resolution screens, meaning the technology will work well on modern devices and interfaces.

Abstract

GUI grounding, the task of mapping natural-language instructions to pixel coordinates, is crucial for autonomous agents, yet remains difficult for current VLMs. The core bottleneck is reliable patch-to-pixel mapping, which breaks when extrapolating to high-resolution displays unseen during training. Current approaches generate coordinates as text tokens directly from visual features, forcing the model to infer complex position-to-pixel mappings implicitly; as a result, accuracy degrades and failures proliferate on new resolutions. We address this with two complementary innovations. First, RULER tokens serve as explicit coordinate markers, letting the model reference positions similar to gridlines on a map and adjust rather than generate coordinates from scratch. Second, Interleaved MRoPE (I-MRoPE) improves spatial encoding by ensuring that width and height dimensions are represented equally, addressing the asymmetry of standard positional schemes. Experiments on ScreenSpot, ScreenSpot-V2, and ScreenSpot-Pro show consistent gains in grounding accuracy, with the largest improvements on high-resolution interfaces. By providing explicit spatial guidance rather than relying on implicit learning, our approach enables more reliable GUI automation across diverse resolutions and platforms.

View Paper