Latent Action Pretraining from Videos

Seonghyeon Ye, Joel Jang, Byeongguk Jeon, Sejune Joo, Jianwei Yang, Baolin Peng, Ajay Mandlekar, Reuben Tan, Yu-Wei Chao, Bill Yuchen Lin, Lars Liden, Kimin Lee, Jianfeng Gao, Luke Zettlemoyer, Dieter Fox, Minjoon Seo

2024-10-15

Summary

This paper introduces Latent Action Pretraining for general Action models (LAPA), a new method that allows robots to learn from videos without needing specific action labels, making it easier to train them for various tasks.

What's the problem?

Most existing methods for training robots require labeled data that tells the robot exactly what actions to take, which can be hard to collect and limits the amount of data available. This means that robots may not learn as effectively as they could if they had access to more diverse training data.

What's the solution?

LAPA solves this problem by using internet-scale videos that do not have specific action labels. It first learns to identify discrete actions from video frames and then trains a model to predict these actions based on observations and task descriptions. Finally, it fine-tunes the model using a small amount of labeled data to connect the learned actions to actual robot movements. This approach allows LAPA to outperform existing methods that rely on labeled actions.

Why it matters?

This research is significant because it opens up new possibilities for training robots using a wider variety of video data available online. By reducing the reliance on labeled data, LAPA can help create more capable robots that can learn and adapt to different tasks more efficiently, which is valuable in fields like robotics, automation, and artificial intelligence.

Abstract

We introduce Latent Action Pretraining for general Action models (LAPA), an unsupervised method for pretraining Vision-Language-Action (VLA) models without ground-truth robot action labels. Existing Vision-Language-Action models require action labels typically collected by human teleoperators during pretraining, which significantly limits possible data sources and scale. In this work, we propose a method to learn from internet-scale videos that do not have robot action labels. We first train an action quantization model leveraging VQ-VAE-based objective to learn discrete latent actions between image frames, then pretrain a latent VLA model to predict these latent actions from observations and task descriptions, and finally finetune the VLA on small-scale robot manipulation data to map from latent to robot actions. Experimental results demonstrate that our method significantly outperforms existing techniques that train robot manipulation policies from large-scale videos. Furthermore, it outperforms the state-of-the-art VLA model trained with robotic action labels on real-world manipulation tasks that require language conditioning, generalization to unseen objects, and semantic generalization to unseen instructions. Training only on human manipulation videos also shows positive transfer, opening up the potential for leveraging web-scale data for robotics foundation model.

View Paper