Action Images: End-to-End Policy Learning via Multiview Video Generation

Haoyu Zhen, Zixian Gao, Qiao Sun, Yilin Zhao, Yuncong Yang, Yilun Du, Tsun-Hsuan Wang, Yi-Ling Qiao, Chuang Gan

2026-04-08

Action Images: End-to-End Policy Learning via Multiview Video Generation

Summary

This paper introduces a new way for robots to learn how to perform tasks by watching videos and understanding what actions are happening. It focuses on making the robot's understanding of actions more visual and connected to what it actually *sees*.

What's the problem?

Current methods for teaching robots using videos often treat actions as abstract codes or numbers, separate from the visual information in the video. This makes it hard for the robot to transfer what it learns from one situation to another, especially if the viewpoint changes or the environment is different. Essentially, the robot doesn't fully utilize the rich information already present in the video it's watching.

What's the solution?

The researchers developed 'Action Images,' which translate robot movements into short, multi-view videos. Imagine showing the robot a video of an arm reaching for an object from different angles. Instead of telling the robot 'move arm to position X,' they *show* it the action visually. This allows the robot's existing video processing system to directly understand and even predict what actions are happening, acting as a policy without needing extra programming. They also showed that this method can be used for other tasks like generating videos of actions or labeling actions in existing videos.

Why it matters?

This work is important because it allows robots to learn more effectively from visual data, leading to better performance and the ability to adapt to new situations. By grounding actions in pixels, the robot can leverage the power of existing video understanding technology and potentially learn new skills with less specific training data, making robots more versatile and easier to deploy in the real world.

Abstract

World action models (WAMs) have emerged as a promising direction for robot policy learning, as they can leverage powerful video backbones to model the future states. However, existing approaches often rely on separate action modules, or use action representations that are not pixel-grounded, making it difficult to fully exploit the pretrained knowledge of video models and limiting transfer across viewpoints and environments. In this work, we present Action Images, a unified world action model that formulates policy learning as multiview video generation. Instead of encoding control as low-dimensional tokens, we translate 7-DoF robot actions into interpretable action images: multi-view action videos that are grounded in 2D pixels and explicitly track robot-arm motion. This pixel-grounded action representation allows the video backbone itself to act as a zero-shot policy, without a separate policy head or action module. Beyond control, the same unified model supports video-action joint generation, action-conditioned video generation, and action labeling under a shared representation. On RLBench and real-world evaluations, our model achieves the strongest zero-shot success rates and improves video-action joint generation quality over prior video-space world models, suggesting that interpretable action images are a promising route to policy learning.

View Paper