RVT-2: Learning Precise Manipulation from Few Demonstrations

Ankit Goyal, Valts Blukis, Jie Xu, Yijie Guo, Yu-Wei Chao, Dieter Fox

2024-06-17

RVT-2: Learning Precise Manipulation from Few Demonstrations

Summary

This paper discusses RVT-2, a new robotic system designed to learn and perform various 3D manipulation tasks based on simple language instructions. It focuses on improving how robots can quickly learn new skills with very few examples.

What's the problem?

Many existing robotic systems struggle to learn precise tasks, especially when they require high accuracy, like inserting plugs or stacking blocks. Previous models often needed a lot of training data and time to become effective, which is not practical for real-world applications where quick learning is essential.

What's the solution?

To address these challenges, the authors developed RVT-2, which is a multitask model that can learn from just about ten demonstrations per task. They made several improvements to the model's architecture and system design, making it six times faster to train and two times faster to use compared to earlier models. RVT-2 was tested on various tasks and showed a significant increase in success rates, going from 65% to 82%. This means it can now perform tasks more accurately and efficiently in real-world situations.

Why it matters?

This research is important because it shows how robots can learn complex tasks quickly and accurately with minimal examples. By enhancing the capabilities of robotic systems like RVT-2, we can make them more useful in both industrial settings and everyday life, helping with tasks that require precision and adaptability.

Abstract

In this work, we study how to build a robotic system that can solve multiple 3D manipulation tasks given language instructions. To be useful in industrial and household domains, such a system should be capable of learning new tasks with few demonstrations and solving them precisely. Prior works, like PerAct and RVT, have studied this problem, however, they often struggle with tasks requiring high precision. We study how to make them more effective, precise, and fast. Using a combination of architectural and system-level improvements, we propose RVT-2, a multitask 3D manipulation model that is 6X faster in training and 2X faster in inference than its predecessor RVT. RVT-2 achieves a new state-of-the-art on RLBench, improving the success rate from 65% to 82%. RVT-2 is also effective in the real world, where it can learn tasks requiring high precision, like picking up and inserting plugs, with just 10 demonstrations. Visual results, code, and trained model are provided at: https://robotic-view-transformer-2.github.io/.

View Paper