Diffusion Policy Policy Optimization

Allen Z. Ren, Justin Lidard, Lars L. Ankile, Anthony Simeonov, Pulkit Agrawal, Anirudha Majumdar, Benjamin Burchfiel, Hongkai Dai, Max Simchowitz

2024-09-04

Summary

This paper talks about Diffusion Policy Policy Optimization (DPPO), a new method for improving how robots learn to perform tasks using reinforcement learning.

What's the problem?

Robots often struggle with learning complex tasks because traditional methods can be inefficient, especially when using diffusion-based policies. These methods are usually less effective for fine-tuning in real-world applications, making it hard for robots to perform well in various environments.

What's the solution?

DPPO introduces a framework that combines best practices for fine-tuning diffusion-based policies in continuous control tasks. It uses a technique called policy gradient (PG) to optimize how robots learn from their experiences. The method allows robots to explore their environment more effectively and improves their ability to perform tasks by making their learning process more stable and robust. DPPO has been tested in various robotic scenarios, showing significant improvements in performance compared to other methods.

Why it matters?

This research is important because it enhances the ability of robots to learn and adapt to new tasks efficiently. By improving the way robots are trained, DPPO can lead to better performance in real-world applications, such as manufacturing, healthcare, and service industries, ultimately making robots more useful and effective.

Abstract

We introduce Diffusion Policy Policy Optimization, DPPO, an algorithmic framework including best practices for fine-tuning diffusion-based policies (e.g. Diffusion Policy) in continuous control and robot learning tasks using the policy gradient (PG) method from reinforcement learning (RL). PG methods are ubiquitous in training RL policies with other policy parameterizations; nevertheless, they had been conjectured to be less efficient for diffusion-based policies. Surprisingly, we show that DPPO achieves the strongest overall performance and efficiency for fine-tuning in common benchmarks compared to other RL methods for diffusion-based policies and also compared to PG fine-tuning of other policy parameterizations. Through experimental investigation, we find that DPPO takes advantage of unique synergies between RL fine-tuning and the diffusion parameterization, leading to structured and on-manifold exploration, stable training, and strong policy robustness. We further demonstrate the strengths of DPPO in a range of realistic settings, including simulated robotic tasks with pixel observations, and via zero-shot deployment of simulation-trained policies on robot hardware in a long-horizon, multi-stage manipulation task. Website with code: diffusion-ppo.github.io

View Paper