Affordance-based Robot Manipulation with Flow Matching

Fan Zhang, Michael Gienger

2024-09-05

Summary

This paper talks about a new framework called Affordance-based Robot Manipulation with Flow Matching, which helps robots learn how to interact with their environment more effectively by understanding what objects can be used for.

What's the problem?

Robots often struggle to adapt to different tasks in everyday life because gathering data on how humans interact with objects is difficult and time-consuming. Additionally, teaching robots how to move and manipulate objects based on visual cues can be challenging, especially when they need to learn from complex environments.

What's the solution?

To solve these problems, the authors developed a method that uses a prompt tuning technique to enhance a vision model, allowing it to predict how to manipulate objects in various tasks. They also created a Flow Matching approach that helps the robot learn the best paths to move towards its goals by treating the movement as a flow of waypoints. This method allows the robot to learn from a new dataset that includes 10 different tasks related to daily activities, improving its ability to understand and perform tasks involving human interaction.

Why it matters?

This research is important because it advances how robots can assist people in daily life by making them more adaptable and efficient in understanding their surroundings. By improving robot manipulation skills, this technology could lead to better assistance in homes, workplaces, and other environments where robots can help with everyday tasks.

Abstract

We present a framework for assistive robot manipulation, which focuses on two fundamental challenges: first, efficiently adapting large-scale models to downstream scene affordance understanding tasks, especially in daily living scenarios where gathering multi-task data involving humans requires strenuous effort; second, effectively learning robot trajectories by grounding the visual affordance model. We tackle the first challenge by employing a parameter-efficient prompt tuning method that prepends learnable text prompts to the frozen vision model to predict manipulation affordances in multi-task scenarios. Then we propose to learn robot trajectories guided by affordances in a supervised Flow Matching method. Flow matching represents a robot visuomotor policy as a conditional process of flowing random waypoints to desired robot trajectories. Finally, we introduce a real-world dataset with 10 tasks across Activities of Daily Living to test our framework. Our extensive evaluation highlights that the proposed prompt tuning method for learning manipulation affordance with language prompter achieves competitive performance and even outperforms other finetuning protocols across data scales, while satisfying parameter efficiency. Learning multi-task robot trajectories with a single flow matching policy also leads to consistently better performance than alternative behavior cloning methods, especially given multimodal robot action distributions. Our framework seamlessly unifies affordance model learning and trajectory generation with flow matching for robot manipulation.

View Paper