DigiRL: Training In-The-Wild Device-Control Agents with Autonomous Reinforcement Learning

Hao Bai, Yifei Zhou, Mert Cemri, Jiayi Pan, Alane Suhr, Sergey Levine, Aviral Kumar

2024-06-21

DigiRL: Training In-The-Wild Device-Control Agents with Autonomous Reinforcement Learning

Summary

This paper presents DigiRL, a new method for training digital agents that can control devices in real-world environments using reinforcement learning (RL).

What's the problem?

Many existing models for controlling devices, like smartphones or tablets, are not very effective because they lack enough data focused on making decisions. Traditional training methods often use static demonstrations, which don't prepare models well for the unpredictable nature of real-world tasks. This means that when these models are faced with actual graphical user interfaces (GUIs), they struggle to perform effectively.

What's the solution?

The researchers developed DigiRL, which trains device-control agents in two main stages: first, it uses offline reinforcement learning to set up the model, and then it shifts to offline-to-online reinforcement learning to refine it further. They created a scalable Android learning environment that allows the model to learn from its interactions in real-time. By using a method called advantage-weighted RL, they enhanced the model's ability to learn from different situations and improve over time. Their results showed that DigiRL significantly increased the success rate of the model from 17.7% to 67.2%, outperforming previous best models.

Why it matters?

This research is important because it establishes a new standard for training digital agents that can effectively control devices in real-world scenarios. By improving how these agents learn and adapt, we can create more reliable and efficient AI systems for everyday tasks, making technology easier and more intuitive for users.

Abstract

Training corpuses for vision language models (VLMs) typically lack sufficient amounts of decision-centric data. This renders off-the-shelf VLMs sub-optimal for decision-making tasks such as in-the-wild device control through graphical user interfaces (GUIs). While training with static demonstrations has shown some promise, we show that such methods fall short for controlling real GUIs due to their failure to deal with real-world stochasticity and non-stationarity not captured in static observational data. This paper introduces a novel autonomous RL approach, called DigiRL, for training in-the-wild device control agents through fine-tuning a pre-trained VLM in two stages: offline RL to initialize the model, followed by offline-to-online RL. To do this, we build a scalable and parallelizable Android learning environment equipped with a VLM-based evaluator and develop a simple yet effective RL approach for learning in this domain. Our approach runs advantage-weighted RL with advantage estimators enhanced to account for stochasticity along with an automatic curriculum for deriving maximal learning signal. We demonstrate the effectiveness of DigiRL using the Android-in-the-Wild (AitW) dataset, where our 1.3B VLM trained with RL achieves a 49.5% absolute improvement -- from 17.7 to 67.2% success rate -- over supervised fine-tuning with static human demonstration data. These results significantly surpass not only the prior best agents, including AppAgent with GPT-4V (8.3% success rate) and the 17B CogAgent trained with AitW data (38.5%), but also the prior best autonomous RL approach based on filtered behavior cloning (57.8%), thereby establishing a new state-of-the-art for digital agents for in-the-wild device control.

View Paper