< Explain other AI papers

GUI Exploration Lab: Enhancing Screen Navigation in Agents via Multi-Turn Reinforcement Learning

Haolong Yan, Yeqing Shen, Xin Huang, Jia Wang, Kaijun Tan, Zhixuan Liang, Hongxin Li, Zheng Ge, Osamu Yoshie, Si Li, Xiangyu Zhang, Daxin Jiang

2025-12-03

GUI Exploration Lab: Enhancing Screen Navigation in Agents via Multi-Turn Reinforcement Learning

Summary

This paper introduces a new simulated environment, called GUI Exploration Lab, designed to help researchers build and test AI agents that can use computer and phone interfaces, like apps and software.

What's the problem?

Developing AI agents that can effectively navigate and use complex software or apps is difficult because it's hard to get enough information about how these programs actually work for training purposes. Real-world programs are often complicated and their inner workings aren't publicly available, making it tough to create realistic training data and test how well an agent can handle new situations.

What's the solution?

The researchers created GUI Exploration Lab, a tool that lets them build custom, simulated interfaces. This allows full control over the environment, meaning they can see everything the agent sees and track its actions. They then used a combination of training methods: first, they showed the agent examples of how to do things (supervised learning), then they let it learn through trial and error with rewards (reinforcement learning), starting with simple tasks and moving to more complex, multi-step interactions.

Why it matters?

This work is important because it provides a standardized way to test and improve AI agents for GUI automation. By creating a controlled environment, researchers can systematically study what works best for teaching agents to navigate and interact with software, ultimately leading to more capable and helpful AI assistants for everyday tasks.

Abstract

With the rapid development of Large Vision Language Models, the focus of Graphical User Interface (GUI) agent tasks shifts from single-screen tasks to complex screen navigation challenges. However, real-world GUI environments, such as PC software and mobile Apps, are often complex and proprietary, making it difficult to obtain the comprehensive environment information needed for agent training and evaluation. This limitation hinders systematic investigation and benchmarking of agent navigation capabilities. To address this limitation, we introduce GUI Exploration Lab, a simulation environment engine for GUI agent navigation research that enables flexible definition and composition of screens, icons, and navigation graphs, while providing full access to environment information for comprehensive agent training and evaluation. Through extensive experiments, we find that supervised fine-tuning enables effective memorization of fundamental knowledge, serving as a crucial foundation for subsequent training. Building on this, single-turn reinforcement learning further enhances generalization to unseen scenarios. Finally, multi-turn reinforcement learning encourages the development of exploration strategies through interactive trial and error, leading to further improvements in screen navigation performance. We validate our methods on both static and interactive benchmarks, demonstrating that our findings generalize effectively to real-world scenarios. These findings demonstrate the advantages of reinforcement learning approaches in GUI navigation and offer practical guidance for building more capable and generalizable GUI agents.