Robots Pre-train Robots: Manipulation-Centric Robotic Representation from Large-Scale Robot Dataset

Guangqi Jiang, Yifei Sun, Tao Huang, Huanyu Li, Yongyuan Liang, Huazhe Xu

2024-10-30

Robots Pre-train Robots: Manipulation-Centric Robotic Representation from Large-Scale Robot Dataset

Summary

This paper discusses a new method for training robots to better perform manipulation tasks by using a large dataset of robot interactions, allowing robots to learn from their own experiences rather than relying on human videos.

What's the problem?

Training robots to perform tasks like picking up objects or assembling parts requires a lot of data, especially videos that show these actions. However, there aren't enough high-quality robot-specific videos available, and using human videos can lead to problems because they don't always match what robots need to learn. This can result in robots not performing well in real-world tasks.

What's the solution?

The authors propose a new approach called Manipulation Centric Representation (MCR), which focuses on capturing both visual features and the dynamics of manipulation tasks. They pre-train a visual encoder using a dataset specifically designed for robotic interactions, called the DROID dataset. By introducing a novel training method that aligns visual observations with the robot's movements and actions, they improve how well robots understand and perform manipulation tasks. Their experiments show that this new method significantly outperforms previous techniques in various robotic tasks.

Why it matters?

This research is important because it helps improve how robots learn to interact with the world around them, making them more effective in performing complex tasks. By using their own experiences instead of relying on human-generated data, robots can become better at understanding and executing actions in real-life situations, which is crucial for applications in industries like manufacturing, healthcare, and service robotics.

Abstract

The pre-training of visual representations has enhanced the efficiency of robot learning. Due to the lack of large-scale in-domain robotic datasets, prior works utilize in-the-wild human videos to pre-train robotic visual representation. Despite their promising results, representations from human videos are inevitably subject to distribution shifts and lack the dynamics information crucial for task completion. We first evaluate various pre-trained representations in terms of their correlation to the downstream robotic manipulation tasks (i.e., manipulation centricity). Interestingly, we find that the "manipulation centricity" is a strong indicator of success rates when applied to downstream tasks. Drawing from these findings, we propose Manipulation Centric Representation (MCR), a foundation representation learning framework capturing both visual features and the dynamics information such as actions and proprioceptions of manipulation tasks to improve manipulation centricity. Specifically, we pre-train a visual encoder on the DROID robotic dataset and leverage motion-relevant data such as robot proprioceptive states and actions. We introduce a novel contrastive loss that aligns visual observations with the robot's proprioceptive state-action dynamics, combined with a behavior cloning (BC)-like actor loss to predict actions during pre-training, along with a time contrastive loss. Empirical results across 4 simulation domains with 20 tasks verify that MCR outperforms the strongest baseline method by 14.8%. Moreover, MCR boosts the performance of data-efficient learning with a UR5e arm on 3 real-world tasks by 76.9%. Project website: https://robots-pretrain-robots.github.io/.

View Paper