VEM: Environment-Free Exploration for Training GUI Agent with Value Environment Model

Jiani Zheng, Lu Wang, Fangkai Yang, Chaoyun Zhang, Lingrui Mei, Wenjie Yin, Qingwei Lin, Dongmei Zhang, Saravan Rajmohan, Qi Zhang

2025-02-27

VEM: Environment-Free Exploration for Training GUI Agent with Value
Environment Model

Summary

This paper talks about a new way to train AI to interact with computer interfaces (like apps on your phone) without needing to practice on real apps thousands of times. The researchers created a system called VEM that can learn how to use apps by studying data from how humans use them.

What's the problem?

Teaching AI to use apps and websites is tricky. The old way required the AI to practice on real apps many times, which is slow and expensive. Other methods that don't use real apps often make mistakes when faced with new or changed interfaces.

What's the solution?

The researchers created a system called VEM (Value Environment Model). VEM learns from data of humans using apps, understanding which actions are helpful for different tasks. It doesn't need to predict exactly what will happen next, just whether an action is good or bad. Then, they use this trained VEM to guide an AI in learning how to use apps without actually practicing on them.

Why it matters?

This matters because it could make it much faster and cheaper to create AI that can help people use complex software or apps. It could lead to better virtual assistants that can actually do tasks for you on your phone or computer. The method is also more flexible, so the AI can adapt to app updates or new interfaces more easily. This could help make technology more accessible to people who struggle with complicated interfaces.

Abstract

Training Vision-Language Models (VLMs) for Graphical User Interfaces (GUI) agents via Reinforcement Learning (RL) faces critical challenges: environment-based RL requires costly interactions, while environment-free methods struggle with distribution shift and reward generalization. We propose an environment-free RL framework that decouples value estimation from policy optimization by leveraging a pretrained Value Environment Model (VEM). VEM predicts state-action values directly from offline data, distilling human-like priors about GUI interaction outcomes without requiring next-state prediction or environmental feedback. This avoids compounding errors and enhances resilience to UI changes by focusing on semantic reasoning (e.g., Does this action advance the user's goal?). The framework operates in two stages: (1) pretraining VEM to estimate long-term action utilities and (2) guiding policy exploration with frozen VEM signals, enabling layout-agnostic GUI automation. Evaluated on Android-in-the-Wild benchmarks, VEM achieves state-of-the-art performance in both offline and online settings, outperforming environment-free baselines significantly and matching environment-based approaches without interaction costs. Importantly, VEM demonstrates that semantic-aware value estimation can achieve comparable performance with online-trained methods.

View Paper