Learning Real-World Action-Video Dynamics with Heterogeneous Masked Autoregression

Lirui Wang, Kevin Zhao, Chaoqi Liu, Xinlei Chen

2025-02-07

Learning Real-World Action-Video Dynamics with Heterogeneous Masked
Autoregression

Summary

This paper talks about a new method called Heterogeneous Masked Autoregression (HMA) for creating realistic videos of robot actions. HMA helps robots learn from videos more efficiently and can generate high-quality simulations for training and testing robot behaviors.

What's the problem?

Teaching robots to understand and interact with the real world is challenging because there are so many different situations and environments to consider. Current methods for creating video simulations of robot actions are either not realistic enough or too slow to be useful for real-time robot learning.

What's the solution?

The researchers developed HMA, which uses a clever technique called masked autoregression to predict and generate videos of robot actions. HMA learns from a wide variety of robot types and tasks, making it more versatile. It can create both precise (quantized) and smooth (soft) video predictions, and it does this much faster than previous methods.

Why it matters?

This research matters because it could speed up and improve how robots learn to interact with the world. By creating more realistic and faster video simulations, HMA allows researchers to test and refine robot behaviors without risking damage to real robots or their surroundings. This could lead to smarter, more adaptable robots that can be used in more complex real-world situations.

Abstract

We propose Heterogeneous Masked Autoregression (HMA) for modeling action-video dynamics to generate high-quality data and evaluation in scaling robot learning. Building interactive video world models and policies for robotics is difficult due to the challenge of handling diverse settings while maintaining computational efficiency to run in real time. HMA uses heterogeneous pre-training from observations and action sequences across different robotic embodiments, domains, and tasks. HMA uses masked autoregression to generate quantized or soft tokens for video predictions. \ourshort achieves better visual fidelity and controllability than the previous robotic video generation models with 15 times faster speed in the real world. After post-training, this model can be used as a video simulator from low-level action inputs for evaluating policies and generating synthetic data. See this link https://liruiw.github.io/hma for more information.

View Paper