< Explain other AI papers

UI-R1: Enhancing Action Prediction of GUI Agents by Reinforcement Learning

Zhengxi Lu, Yuxiang Chai, Yaxuan Guo, Xi Yin, Liang Liu, Hao Wang, Guanjing Xiong, Hongsheng Li

2025-03-28

UI-R1: Enhancing Action Prediction of GUI Agents by Reinforcement
  Learning

Summary

This paper is about improving how AI predicts what actions to take on phone interfaces by using a learning technique inspired by recent AI breakthroughs.

What's the problem?

AI models often struggle to understand and interact with phone interfaces, making it difficult for them to automate tasks or assist users effectively.

What's the solution?

The researchers developed a new AI model that uses reinforcement learning to learn from its mistakes and improve its ability to predict the right actions on a phone interface.

Why it matters?

This work matters because it can lead to more helpful and efficient AI assistants for mobile devices, making it easier for people to use their phones and automate tasks.

Abstract

The recent DeepSeek-R1 has showcased the emergence of reasoning capabilities in LLMs through reinforcement learning (RL) with rule-based rewards. Building on this idea, we are the first to explore how rule-based RL can enhance the reasoning capabilities of multimodal large language models (MLLMs) for graphic user interface (GUI) action prediction tasks. To this end, we curate a small yet high-quality dataset of 136 challenging tasks, encompassing five common action types on mobile devices. We also introduce a unified rule-based action reward, enabling model optimization via policy-based algorithms such as Group Relative Policy Optimization (GRPO). Experimental results demonstrate that our proposed data-efficient model, UI-R1-3B, achieves substantial improvements on both in-domain (ID) and out-of-domain (OOD) tasks. Specifically, on the ID benchmark AndroidControl, the action type accuracy improves by 15%, while grounding accuracy increases by 10.3%, compared with the base model (i.e. Qwen2.5-VL-3B). On the OOD GUI grounding benchmark ScreenSpot-Pro, our model surpasses the base model by 6.0% and achieves competitive performance with larger models (e.g., OS-Atlas-7B), which are trained via supervised fine-tuning (SFT) on 76K data. These results underscore the potential of rule-based reinforcement learning to advance GUI understanding and control, paving the way for future research in this domain.