InfiGUI-G1: Advancing GUI Grounding with Adaptive Exploration Policy Optimization

Yuhang Liu, Zeyu Liu, Shuanghe Zhu, Pengxiang Li, Congkai Xie, Jiasheng Wang, Xueyu Hu, Xiaotian Han, Jianbo Yuan, Xinyao Wang, Shengyu Zhang, Hongxia Yang, Fei Wu

2025-08-11

InfiGUI-G1: Advancing GUI Grounding with Adaptive Exploration Policy
Optimization

Summary

This paper talks about a new method called Adaptive Exploration Policy Optimization (AEPO) that helps Multimodal Large Language Models (MLLMs) better understand instructions for interacting with graphical user interfaces (GUIs). It improves how the models match language instructions to the right parts of the user interface.

What's the problem?

The problem is that although existing methods can help models find the right location on the screen, they struggle with correctly linking the natural language instructions to the exact functional element because they do not explore enough possibilities. This limited exploration causes semantic alignment issues, meaning the model often does not fully understand which UI element the instruction refers to.

What's the solution?

The paper presents AEPO, a new approach that encourages the model to explore many possible answers using a multi-answer generation strategy. This exploration is balanced by an adaptive reward system designed to encourage both wide and purposeful searching for the correct element. This way, the model learns better associations between instructions and UI elements. AEPO-trained models set new records on standard tests, showing up to 9% improvement in matching instructions to UI elements compared to older reinforcement learning methods.

Why it matters?

This matters because improving how AI understands and interacts with GUIs through natural language can lead to more intelligent and useful autonomous agents. These agents could operate software visually just by following instructions, which would make interacting with complex systems faster, more accurate, and accessible for people without technical skills.

Abstract

Adaptive Exploration Policy Optimization (AEPO) enhances semantic alignment in Multimodal Large Language Models (MLLMs) for GUI interaction, improving performance on benchmarks by up to 9.0% compared to RLVR.

View Paper