ABot-M0: VLA Foundation Model for Robotic Manipulation with Action Manifold Learning

Yandan Yang, Shuang Zeng, Tong Lin, Xinyuan Chang, Dekang Qi, Junjin Xiao, Haoyun Liu, Ronghan Chen, Yuzhi Chen, Dongjie Huo, Feng Xiong, Xing Wei, Zhiheng Ma, Mu Xu

2026-02-16

ABot-M0: VLA Foundation Model for Robotic Manipulation with Action Manifold Learning

Summary

This paper introduces ABot-M0, a new system designed to create robots that can perform a variety of tasks using different bodies – essentially, giving robots a single 'brain' that can control many different 'forms'. It focuses on making it easier to train these robots by improving how data is collected, organized, and used for learning.

What's the problem?

Currently, building robots that can adapt to different situations and hardware is difficult because the data used to train them is often messy, inconsistent, and doesn't align well with what the robot needs to learn. Different robots produce data in different formats, making it hard to combine information and teach a robot to generalize its skills. It's like trying to learn from textbooks written in multiple languages with no translation.

What's the solution?

The researchers created a system called ABot-M0 that tackles this problem in a few key ways. First, they built a huge, standardized dataset called UniACT-dataset by cleaning and organizing data from six existing robot datasets. Then, they proposed the idea that robot actions aren't random, but follow patterns based on physics and the task at hand – the 'Action Manifold Hypothesis'. They used this idea to develop a new learning method, Action Manifold Learning (AML), which helps the robot predict actions more efficiently and reliably. Finally, they added a way for the robot to better understand its surroundings by combining visual information with 3D data.

Why it matters?

This work is important because it represents a step towards creating truly versatile robots. By making it easier to train robots to work with different bodies and in different environments, we can move closer to having robots that can help us with a wider range of tasks, from manufacturing and logistics to healthcare and exploration. The release of their code and data will also allow other researchers to build upon their work and accelerate progress in the field.

Abstract

Building general-purpose embodied agents across diverse hardware remains a central challenge in robotics, often framed as the ''one-brain, many-forms'' paradigm. Progress is hindered by fragmented data, inconsistent representations, and misaligned training objectives. We present ABot-M0, a framework that builds a systematic data curation pipeline while jointly optimizing model architecture and training strategies, enabling end-to-end transformation of heterogeneous raw data into unified, efficient representations. From six public datasets, we clean, standardize, and balance samples to construct UniACT-dataset, a large-scale dataset with over 6 million trajectories and 9,500 hours of data, covering diverse robot morphologies and task scenarios. Unified pre-training improves knowledge transfer and generalization across platforms and tasks, supporting general-purpose embodied intelligence. To improve action prediction efficiency and stability, we propose the Action Manifold Hypothesis: effective robot actions lie not in the full high-dimensional space but on a low-dimensional, smooth manifold governed by physical laws and task constraints. Based on this, we introduce Action Manifold Learning (AML), which uses a DiT backbone to predict clean, continuous action sequences directly. This shifts learning from denoising to projection onto feasible manifolds, improving decoding speed and policy stability. ABot-M0 supports modular perception via a dual-stream mechanism that integrates VLM semantics with geometric priors and multi-view inputs from plug-and-play 3D modules such as VGGT and Qwen-Image-Edit, enhancing spatial understanding without modifying the backbone and mitigating standard VLM limitations in 3D reasoning. Experiments show components operate independently with additive benefits. We will release all code and pipelines for reproducibility and future research.

View Paper