Leveraging Skills from Unlabeled Prior Data for Efficient Online Exploration

Max Wilcoxson, Qiyang Li, Kevin Frans, Sergey Levine

2024-10-28

Leveraging Skills from Unlabeled Prior Data for Efficient Online Exploration

Summary

This paper discusses a new approach called SUPE, which helps reinforcement learning (RL) agents learn how to explore their environments more efficiently by using data from previous experiences that haven't been labeled.

What's the problem?

In reinforcement learning, agents learn by trying out actions and seeing what happens, but they often struggle to explore effectively. Traditional methods require a lot of specific data to guide their learning, which can be hard to gather. Additionally, when agents only rely on labeled data, they miss out on valuable information from past experiences that could help them learn faster.

What's the solution?

The authors introduce a method called SUPE (Skills from Unlabeled Prior data for Exploration), which uses unlabeled data collected from previous actions to improve the exploration strategies of RL agents. First, they extract low-level skills from this prior data using a technique called a variational autoencoder (VAE). Then, they relabel these past experiences with optimistic estimates of rewards to create high-level examples that are relevant to the tasks at hand. By combining these skills and examples, SUPE enables agents to learn more effectively and explore their environments better.

Why it matters?

This research is important because it shows how leveraging unlabeled prior data can significantly enhance the learning process for RL agents. By improving exploration strategies, SUPE can help agents solve complex tasks more quickly and efficiently, which is valuable for developing smarter AI systems that can operate in real-world scenarios.

Abstract

Unsupervised pretraining has been transformative in many supervised domains. However, applying such ideas to reinforcement learning (RL) presents a unique challenge in that fine-tuning does not involve mimicking task-specific data, but rather exploring and locating the solution through iterative self-improvement. In this work, we study how unlabeled prior trajectory data can be leveraged to learn efficient exploration strategies. While prior data can be used to pretrain a set of low-level skills, or as additional off-policy data for online RL, it has been unclear how to combine these ideas effectively for online exploration. Our method SUPE (Skills from Unlabeled Prior data for Exploration) demonstrates that a careful combination of these ideas compounds their benefits. Our method first extracts low-level skills using a variational autoencoder (VAE), and then pseudo-relabels unlabeled trajectories using an optimistic reward model, transforming prior data into high-level, task-relevant examples. Finally, SUPE uses these transformed examples as additional off-policy data for online RL to learn a high-level policy that composes pretrained low-level skills to explore efficiently. We empirically show that SUPE reliably outperforms prior strategies, successfully solving a suite of long-horizon, sparse-reward tasks. Code: https://github.com/rail-berkeley/supe.

View Paper