Unified Video Action Model

Shuang Li, Yihuai Gao, Dorsa Sadigh, Shuran Song

2025-03-05

Summary

This paper talks about a new AI model called the Unified Video Action model (UVA) that combines video understanding and action prediction for robots. It's designed to help robots better understand their environment and make decisions more quickly and accurately.

What's the problem?

Current methods for teaching robots to understand videos and decide on actions are either good at one task or the other, but not both. They're either accurate but slow, or fast but not as precise. This makes it hard for robots to react quickly and correctly in real-world situations.

What's the solution?

The researchers created UVA, which does two clever things. First, it learns to understand videos and actions together, helping the robot grasp the connection between what it sees and what it should do. Second, it separates the process of understanding videos from deciding actions, so the robot can make quick decisions without needing to analyze the whole video every time. UVA can also handle different tasks by selectively hiding parts of the input, making it very flexible.

Why it matters?

This matters because it could make robots much smarter and more adaptable. With UVA, robots could potentially handle a wide range of tasks more efficiently, from predicting what will happen next to figuring out what actions to take in new situations. This could lead to more capable and versatile robots in various fields, from manufacturing to healthcare, without needing separate specialized systems for each task.

Abstract

A unified video and action model holds significant promise for robotics, where videos provide rich scene information for action prediction, and actions provide dynamics information for video prediction. However, effectively combining video generation and action prediction remains challenging, and current video generation-based methods struggle to match the performance of direct policy learning in action accuracy and inference speed. To bridge this gap, we introduce the Unified Video Action model (UVA), which jointly optimizes video and action predictions to achieve both high accuracy and efficient action inference. The key lies in learning a joint video-action latent representation and decoupling video-action decoding. The joint latent representation bridges the visual and action domains, effectively modeling the relationship between video and action sequences. Meanwhile, the decoupled decoding, powered by two lightweight diffusion heads, enables high-speed action inference by bypassing video generation during inference. Such a unified framework further enables versatile functionality through masked input training. By selectively masking actions or videos, a single model can tackle diverse tasks beyond policy learning, such as forward and inverse dynamics modeling and video generation. Via an extensive set of experiments, we demonstrate that UVA can serve as a general-purpose solution for a wide range of robotics tasks, such as policy learning, forward/inverse dynamics and video observation prediction, without compromising performance compared to methods tailored for specific applications. Results are best viewed on https://unified-video-action-model.github.io/.

View Paper