Maximizing Alignment with Minimal Feedback: Efficiently Learning Rewards for Visuomotor Robot Policy Alignment

Ran Tian, Yilin Wu, Chenfeng Xu, Masayoshi Tomizuka, Jitendra Malik, Andrea Bajcsy

2024-12-11

Maximizing Alignment with Minimal Feedback: Efficiently Learning Rewards for Visuomotor Robot Policy Alignment

Summary

This paper talks about a new method called Representation-Aligned Preference-based Learning (RAPL) that helps robots learn how to perform tasks better by using less human feedback.

What's the problem?

Robots that learn to perform tasks, like picking up objects or navigating spaces, often need to be aligned with what humans want them to do. However, traditional methods of training these robots require a lot of human feedback, which can be time-consuming and impractical. This makes it hard to teach robots how to act in ways that match human preferences, especially when those preferences are not easy to explain.

What's the solution?

The authors propose RAPL, which allows robots to learn from fewer human preferences by focusing on fine-tuning their visual understanding instead of relying heavily on direct feedback. RAPL uses existing data from the robot's experiences and aligns its learning with what humans prefer by adjusting how the robot interprets visual information. This method was tested in simulations and real-world robotic tasks, showing that it can effectively improve the robot's performance while needing five times less human input than traditional methods.

Why it matters?

This research is important because it makes it easier and more efficient to train robots to understand and follow human preferences. By reducing the amount of feedback needed, RAPL can help accelerate the development of smarter robots that can perform complex tasks in various environments, making them more useful in everyday applications like home assistance, manufacturing, and healthcare.

Abstract

Visuomotor robot policies, increasingly pre-trained on large-scale datasets, promise significant advancements across robotics domains. However, aligning these policies with end-user preferences remains a challenge, particularly when the preferences are hard to specify. While reinforcement learning from human feedback (RLHF) has become the predominant mechanism for alignment in non-embodied domains like large language models, it has not seen the same success in aligning visuomotor policies due to the prohibitive amount of human feedback required to learn visual reward functions. To address this limitation, we propose Representation-Aligned Preference-based Learning (RAPL), an observation-only method for learning visual rewards from significantly less human preference feedback. Unlike traditional RLHF, RAPL focuses human feedback on fine-tuning pre-trained vision encoders to align with the end-user's visual representation and then constructs a dense visual reward via feature matching in this aligned representation space. We first validate RAPL through simulation experiments in the X-Magical benchmark and Franka Panda robotic manipulation, demonstrating that it can learn rewards aligned with human preferences, more efficiently uses preference data, and generalizes across robot embodiments. Finally, our hardware experiments align pre-trained Diffusion Policies for three object manipulation tasks. We find that RAPL can fine-tune these policies with 5x less real human preference data, taking the first step towards minimizing human feedback while maximizing visuomotor robot policy alignment.

View Paper