StaMo: Unsupervised Learning of Generalizable Robot Motion from Compact State Representation

Mingyu Liu, Jiuhe Shu, Hui Chen, Zeju Li, Canyu Zhao, Jiange Yang, Shenyuan Gao, Hao Chen, Chunhua Shen

2025-10-09

StaMo: Unsupervised Learning of Generalizable Robot Motion from Compact State Representation

Summary

This paper introduces a new way for robots to understand and represent their surroundings, allowing them to make decisions more efficiently. It focuses on creating a simplified 'mental picture' of a scene that still contains all the important information for a robot to act.

What's the problem?

Currently, robots struggle to create good 'mental pictures' of the world. Existing methods either create pictures with too much unnecessary detail, slowing down processing, or they leave out crucial information needed to complete tasks. Essentially, it's hard to find the right balance between simplicity and usefulness when representing a robot's environment.

What's the solution?

The researchers developed a method called StaMo that uses a two-part system. First, it compresses images of a scene into a very short code using a simple 'encoder'. Then, it uses a powerful pre-existing AI model, called a Diffusion Transformer, to 'decode' that code back into a meaningful representation. Surprisingly, the difference between these codes naturally suggests what action the robot should take, which can then be translated into actual robot movements. This all happens without needing specific instructions on what actions are good – the robot learns this on its own.

Why it matters?

This work is important because it shows robots can learn to understand their environment and plan actions using a much simpler system than previously thought. It improves performance on robot tasks, makes the robot's decision-making process easier to understand, and works well with different types of data like real-world footage, simulations, and even videos of people. It also reduces the reliance on complex AI structures and large amounts of video data for learning how to move.

Abstract

A fundamental challenge in embodied intelligence is developing expressive and compact state representations for efficient world modeling and decision making. However, existing methods often fail to achieve this balance, yielding representations that are either overly redundant or lacking in task-critical information. We propose an unsupervised approach that learns a highly compressed two-token state representation using a lightweight encoder and a pre-trained Diffusion Transformer (DiT) decoder, capitalizing on its strong generative prior. Our representation is efficient, interpretable, and integrates seamlessly into existing VLA-based models, improving performance by 14.3% on LIBERO and 30% in real-world task success with minimal inference overhead. More importantly, we find that the difference between these tokens, obtained via latent interpolation, naturally serves as a highly effective latent action, which can be further decoded into executable robot actions. This emergent capability reveals that our representation captures structured dynamics without explicit supervision. We name our method StaMo for its ability to learn generalizable robotic Motion from compact State representation, which is encoded from static images, challenging the prevalent dependence to learning latent action on complex architectures and video data. The resulting latent actions also enhance policy co-training, outperforming prior methods by 10.4% with improved interpretability. Moreover, our approach scales effectively across diverse data sources, including real-world robot data, simulation, and human egocentric video.

View Paper