UI-Ins: Enhancing GUI Grounding with Multi-Perspective Instruction-as-Reasoning

Liangyu Chen, Hanzhang Zhou, Chenglin Cai, Jianan Zhang, Panrong Tong, Quyu Kong, Xu Zhang, Chen Liu, Yuqi Liu, Wenxuan Wang, Yue Wang, Qin Jin, Steven Hoi

2025-10-27

UI-Ins: Enhancing GUI Grounding with Multi-Perspective Instruction-as-Reasoning

Summary

This paper focuses on improving how AI agents understand and follow instructions when interacting with computer interfaces, like apps on your phone or programs on your computer.

What's the problem?

Current AI agents often struggle with understanding instructions because they treat all instructions as equally good, even if some are poorly worded or unclear. The researchers found that many existing datasets used to train these agents actually contain flawed instructions – over 23% of them! This means the agents are learning from imperfect examples, and aren't taking advantage of the fact that different ways of saying the same thing can help them figure out what to do.

What's the solution?

The researchers introduced a new approach called 'Instruction-as-Reasoning'. Instead of just seeing an instruction as a single command, the AI considers instructions as different ways to *think* about solving a problem. They trained the AI in two steps: first, they showed it lots of different ways to ask for the same thing, so it learned to see multiple perspectives. Then, they used a reward system to teach it to choose the best 'thinking pathway' – the most helpful instruction – when it needs to act. This resulted in two new models, UI-Ins-7B and UI-Ins-32B, that are better at understanding instructions.

Why it matters?

This work is important because it significantly improves the reliability and performance of AI agents that interact with software. The new models achieve top results on several tests, and can even successfully complete tasks in a simulated Android environment. By teaching the AI to reason about instructions, the researchers have created agents that are more robust, adaptable, and capable of handling real-world tasks.

Abstract

GUI grounding, which maps natural-language instructions to actionable UI elements, is a core capability of GUI agents. Prior works largely treats instructions as a static proxy for user intent, overlooking the impact of instruction diversity and quality on grounding performance. Through a careful investigation of existing grounding datasets, we find a 23.3% flaw rate in their instructions and show that inference-time exploitation of instruction diversity yields up to a substantial 76% relative performance improvement. In this paper, we introduce the Instruction-as-Reasoning paradigm, treating instructions as dynamic analytical pathways that offer distinct perspectives and enabling the model to select the most effective pathway during reasoning. To achieve this, we propose a two-stage training framework: supervised fine-tuning (SFT) on synthesized, diverse instructions to instill multi-perspective reasoning, followed by reinforcement learning (RL) to optimize pathway selection and composition. Our resulting models, UI-Ins-7B and UI-Ins-32B, achieve state-of-the-art results on five challenging grounding benchmarks and exhibit emergent reasoning, selectively composing and synthesizing novel instruction pathways at inference. In particular, UI-Ins-32B attains the best grounding accuracy, scoring 87.3% on UI-I2E-Bench, 57.0% on ScreenSpot-Pro, and 84.9% on MMBench-GUI L2. Furthermore, our model demonstrates strong agentic potential, achieving a 74.1% success rate on AndroidWorld using UI-Ins-7B as the executor. Our in-depth analysis reveals additional insights such as how reasoning can be formulated to enhance rather than hinder grounding performance, and how our method mitigates policy collapse in the SFT+RL framework. All code and model checkpoints will be publicly released in https://github.com/alibaba/UI-Ins.

View Paper