Efficient Diffusion Transformer Policies with Mixture of Expert Denoisers for Multitask Learning

Moritz Reuss, Jyothish Pari, Pulkit Agrawal, Rudolf Lioutikov

2024-12-19

Efficient Diffusion Transformer Policies with Mixture of Expert Denoisers for Multitask Learning

Summary

This paper discusses a new approach called Mixture-of-Denoising Experts (MoDE) for improving how robots learn tasks through imitation. It focuses on making the learning process more efficient and effective by using advanced techniques that reduce the amount of computing power needed while achieving better results.

What's the problem?

As robot learning models become more complex, they require more computing resources, which can slow down the learning process and make it less practical. Traditional methods struggle with efficiency and often do not perform well across different tasks, leading to inconsistencies in robot behavior.

What's the solution?

The authors propose the MoDE approach, which uses a mixture of expert denoisers to enhance the learning process. This method allows for better scaling of the model, reducing the number of parameters needed by 40% and cutting down inference costs by 90%. By training on diverse robotic data, MoDE significantly outperforms previous models in various tasks without needing extensive adjustments or tuning.

Why it matters?

This research is important because it not only improves how robots learn but also makes it more feasible to apply these advanced learning techniques in real-world scenarios. The efficiency gains mean that more complex tasks can be tackled with less computational power, opening up new possibilities for robotics applications in industries like manufacturing, healthcare, and automation.

Abstract

Diffusion Policies have become widely used in Imitation Learning, offering several appealing properties, such as generating multimodal and discontinuous behavior. As models are becoming larger to capture more complex capabilities, their computational demands increase, as shown by recent scaling laws. Therefore, continuing with the current architectures will present a computational roadblock. To address this gap, we propose Mixture-of-Denoising Experts (MoDE) as a novel policy for Imitation Learning. MoDE surpasses current state-of-the-art Transformer-based Diffusion Policies while enabling parameter-efficient scaling through sparse experts and noise-conditioned routing, reducing both active parameters by 40% and inference costs by 90% via expert caching. Our architecture combines this efficient scaling with noise-conditioned self-attention mechanism, enabling more effective denoising across different noise levels. MoDE achieves state-of-the-art performance on 134 tasks in four established imitation learning benchmarks (CALVIN and LIBERO). Notably, by pretraining MoDE on diverse robotics data, we achieve 4.01 on CALVIN ABC and 0.95 on LIBERO-90. It surpasses both CNN-based and Transformer Diffusion Policies by an average of 57% across 4 benchmarks, while using 90% fewer FLOPs and fewer active parameters compared to default Diffusion Transformer architectures. Furthermore, we conduct comprehensive ablations on MoDE's components, providing insights for designing efficient and scalable Transformer architectures for Diffusion Policies. Code and demonstrations are available at https://mbreuss.github.io/MoDE_Diffusion_Policy/.

View Paper