RL Zero: Zero-Shot Language to Behaviors without any Supervision

Harshit Sikchi, Siddhant Agarwal, Pranaya Jajoo, Samyak Parajuli, Caleb Chuck, Max Rudolph, Peter Stone, Amy Zhang, Scott Niekum

2024-12-09

RL Zero: Zero-Shot Language to Behaviors without any Supervision

Summary

This paper talks about RLZero, a new method that allows AI agents to understand and perform tasks based on natural language instructions without needing any prior training or supervision.

What's the problem?

In reinforcement learning, using rewards to guide behavior can be confusing because it's hard for humans to predict what actions will lead to the best outcomes. This often results in poorly designed reward systems. Additionally, previous methods for teaching AI with language required a lot of labeled data, which is expensive and time-consuming to create.

What's the solution?

The authors propose RLZero, which uses a three-step process: imagine, project, and imitate. First, the AI imagines what actions it should take based on the language instructions. Then, it projects these imagined actions onto real-world observations to see how they would work in practice. Finally, it imitates these actions using its previous experiences without needing specific training for each new task. This allows the AI to learn how to perform tasks just from understanding the language without any extra guidance.

Why it matters?

This research is important because it simplifies how we can teach AI to perform complex tasks using everyday language. By eliminating the need for extensive training and labeling, RLZero opens up new possibilities for making AI more adaptable and easier to use in various applications, such as robotics and personal assistants.

Abstract

Rewards remain an uninterpretable way to specify tasks for Reinforcement Learning, as humans are often unable to predict the optimal behavior of any given reward function, leading to poor reward design and reward hacking. Language presents an appealing way to communicate intent to agents and bypass reward design, but prior efforts to do so have been limited by costly and unscalable labeling efforts. In this work, we propose a method for a completely unsupervised alternative to grounding language instructions in a zero-shot manner to obtain policies. We present a solution that takes the form of imagine, project, and imitate: The agent imagines the observation sequence corresponding to the language description of a task, projects the imagined sequence to our target domain, and grounds it to a policy. Video-language models allow us to imagine task descriptions that leverage knowledge of tasks learned from internet-scale video-text mappings. The challenge remains to ground these generations to a policy. In this work, we show that we can achieve a zero-shot language-to-behavior policy by first grounding the imagined sequences in real observations of an unsupervised RL agent and using a closed-form solution to imitation learning that allows the RL agent to mimic the grounded observations. Our method, RLZero, is the first to our knowledge to show zero-shot language to behavior generation abilities without any supervision on a variety of tasks on simulated domains. We further show that RLZero can also generate policies zero-shot from cross-embodied videos such as those scraped from YouTube.

View Paper