Being-H0.5: Scaling Human-Centric Robot Learning for Cross-Embodiment Generalization
Hao Luo, Ye Wang, Wanpeng Zhang, Sipeng Zheng, Ziheng Xi, Chaoyi Xu, Haiweng Xu, Haoqi Yuan, Chi Zhang, Yiqing Wang, Yicheng Feng, Zongqing Lu
2026-01-21
Summary
This paper introduces Being-H0.5, a new artificial intelligence model designed to help robots understand and perform tasks based on both visual information and language commands, and importantly, to work with many different types of robots.
What's the problem?
Existing robots that try to understand language and perform actions often struggle when you change the robot's physical design or if there isn't a lot of training data available for that specific robot. It's hard to get a robot to learn a skill once and then easily apply it to a different body. Essentially, robots haven't figured out how to learn from humans in a way that easily transfers to different robotic forms.
What's the solution?
The researchers tackled this by focusing on how humans interact with the physical world. They created a huge dataset, UniHand-2.0, with over 35,000 hours of data showing humans interacting with objects. They then developed a system where different robot controls are translated into a common 'language' of actions. The model, Being-H0.5, uses a special architecture called a Mixture-of-Transformers to separate general movement skills from skills specific to each robot. They also added techniques to make the robot's actions more stable and adaptable to real-world conditions, even if the robot's sensors are slightly off or it responds at a different speed.
Why it matters?
This work is important because it's a big step towards creating robots that are more versatile and easier to use. Instead of needing to retrain a robot from scratch for every new task or body, we can potentially leverage human knowledge and a single model to control a wide range of robotic platforms. This could lead to robots that are more helpful in everyday life, from assisting in homes to working in factories.
Abstract
We introduce Being-H0.5, a foundational Vision-Language-Action (VLA) model designed for robust cross-embodiment generalization across diverse robotic platforms. While existing VLAs often struggle with morphological heterogeneity and data scarcity, we propose a human-centric learning paradigm that treats human interaction traces as a universal "mother tongue" for physical interaction. To support this, we present UniHand-2.0, the largest embodied pre-training recipe to date, comprising over 35,000 hours of multimodal data across 30 distinct robotic embodiments. Our approach introduces a Unified Action Space that maps heterogeneous robot controls into semantically aligned slots, enabling low-resource robots to bootstrap skills from human data and high-resource platforms. Built upon this human-centric foundation, we design a unified sequential modeling and multi-task pre-training paradigm to bridge human demonstrations and robotic execution. Architecturally, Being-H0.5 utilizes a Mixture-of-Transformers design featuring a novel Mixture-of-Flow (MoF) framework to decouple shared motor primitives from specialized embodiment-specific experts. Finally, to make cross-embodiment policies stable in the real world, we introduce Manifold-Preserving Gating for robustness under sensory shift and Universal Async Chunking to universalize chunked control across embodiments with different latency and control profiles. We empirically demonstrate that Being-H0.5 achieves state-of-the-art results on simulated benchmarks, such as LIBERO (98.9%) and RoboCasa (53.9%), while also exhibiting strong cross-embodiment capabilities on five robotic platforms.