CARP: Visuomotor Policy Learning via Coarse-to-Fine Autoregressive Prediction

Zhefei Gong, Pengxiang Ding, Shangke Lyu, Siteng Huang, Mingyang Sun, Wei Zhao, Zhaoxin Fan, Donglin Wang

2024-12-10

CARP: Visuomotor Policy Learning via Coarse-to-Fine Autoregressive Prediction

Summary

This paper talks about CARP, a new method for teaching robots how to perform tasks by improving how they predict and plan their actions using a two-step process.

What's the problem?

Robots often struggle to accurately predict the best actions to take because existing methods can be slow and inefficient. Traditional models may require many steps to refine their predictions, which can lead to delays and less effective performance. Additionally, they might not adapt well to complex tasks that require flexibility.

What's the solution?

The authors introduce the Coarse-to-Fine AutoRegressive Policy (CARP), which breaks down the action prediction process into two main stages. First, it uses an action autoencoder to learn a broad understanding of the entire action sequence. Then, a transformer model refines these predictions step-by-step in a more detailed way. This approach allows CARP to produce smooth and accurate actions while being faster and more efficient than previous methods. The results show that CARP outperforms existing models in various robotic tasks, achieving better success rates and faster processing times.

Why it matters?

This research is important because it enhances how robots learn and perform tasks, making them more efficient and capable in real-world situations. By improving the way robots predict their actions, CARP can lead to advancements in robotics applications, such as manufacturing, healthcare, and service industries, where precise movements are crucial.

Abstract

In robotic visuomotor policy learning, diffusion-based models have achieved significant success in improving the accuracy of action trajectory generation compared to traditional autoregressive models. However, they suffer from inefficiency due to multiple denoising steps and limited flexibility from complex constraints. In this paper, we introduce Coarse-to-Fine AutoRegressive Policy (CARP), a novel paradigm for visuomotor policy learning that redefines the autoregressive action generation process as a coarse-to-fine, next-scale approach. CARP decouples action generation into two stages: first, an action autoencoder learns multi-scale representations of the entire action sequence; then, a GPT-style transformer refines the sequence prediction through a coarse-to-fine autoregressive process. This straightforward and intuitive approach produces highly accurate and smooth actions, matching or even surpassing the performance of diffusion-based policies while maintaining efficiency on par with autoregressive policies. We conduct extensive evaluations across diverse settings, including single-task and multi-task scenarios on state-based and image-based simulation benchmarks, as well as real-world tasks. CARP achieves competitive success rates, with up to a 10% improvement, and delivers 10x faster inference compared to state-of-the-art policies, establishing a high-performance, efficient, and flexible paradigm for action generation in robotic tasks.

View Paper