DynaMo: In-Domain Dynamics Pretraining for Visuo-Motor Control
Zichen Jeff Cui, Hengkai Pan, Aadhithya Iyer, Siddhant Haldar, Lerrel Pinto
2024-09-25

Summary
This paper introduces DynaMo, a new method for improving how robots learn to control their movements by using a technique called in-domain dynamics pretraining. This approach helps robots understand how to interact with their environment more effectively without needing a lot of real-world training data.
What's the problem?
Training robots to perform tasks usually requires many examples of expert demonstrations, which can be expensive and time-consuming to collect. Current methods often rely on data that may not accurately reflect the specific tasks the robot will face, leading to inefficiencies and difficulties in generalizing to new situations or objects.
What's the solution?
DynaMo addresses this problem by creating a self-supervised learning method that uses expert demonstrations to train a model that predicts how objects move and interact in the robot's environment. This model learns from visual observations without needing additional data from outside sources. By focusing on the dynamics of the specific domain, DynaMo allows robots to learn more efficiently and effectively, improving their performance on various tasks with less data.
Why it matters?
This research is significant because it enhances the ability of robots to learn and adapt to new tasks quickly, making them more versatile and capable in real-world applications. By reducing the need for extensive training data, DynaMo can help accelerate the development of robotic systems that can operate in diverse environments, which is crucial for advancements in fields like automation, manufacturing, and service robotics.
Abstract
Imitation learning has proven to be a powerful tool for training complex visuomotor policies. However, current methods often require hundreds to thousands of expert demonstrations to handle high-dimensional visual observations. A key reason for this poor data efficiency is that visual representations are predominantly either pretrained on out-of-domain data or trained directly through a behavior cloning objective. In this work, we present DynaMo, a new in-domain, self-supervised method for learning visual representations. Given a set of expert demonstrations, we jointly learn a latent inverse dynamics model and a forward dynamics model over a sequence of image embeddings, predicting the next frame in latent space, without augmentations, contrastive sampling, or access to ground truth actions. Importantly, DynaMo does not require any out-of-domain data such as Internet datasets or cross-embodied datasets. On a suite of six simulated and real environments, we show that representations learned with DynaMo significantly improve downstream imitation learning performance over prior self-supervised learning objectives, and pretrained representations. Gains from using DynaMo hold across policy classes such as Behavior Transformer, Diffusion Policy, MLP, and nearest neighbors. Finally, we ablate over key components of DynaMo and measure its impact on downstream policy performance. Robot videos are best viewed at https://dynamo-ssl.github.io