UniT: Toward a Unified Physical Language for Human-to-Humanoid Policy Learning and World Modeling
Boyu Chen, Yi Chen, Lu Qiu, Jerry Bai, Yuying Ge, Yixiao Ge
2026-04-24
Summary
This paper addresses the challenge of teaching robots, specifically humanoid robots, to perform tasks by learning from human demonstrations. It introduces a new framework called UniT that helps bridge the gap between how humans and robots move and interact with the world.
What's the problem?
Training humanoid robots is difficult because there isn't a lot of data showing them how to do things. While we have tons of videos of humans performing actions, robots and humans move very differently – their bodies are built differently. This difference, called a 'kinematic mismatch', makes it hard for robots to directly learn from human examples. Simply showing a robot a video of a person won't automatically translate into the robot knowing how to do the same thing.
What's the solution?
UniT tackles this problem by creating a common 'language' for understanding actions, regardless of who is performing them. It works by having the system predict what a person *sees* when they do something, and also predict what action a person is doing based on what is seen. This process filters out unnecessary details and focuses on the core physical intent of the action. Then, it combines these predictions into a shared code that represents the action in a way that both humans and robots can understand. This allows the robot to learn from human data and then perform the action itself, or even generate realistic videos of a robot doing the action.
Why it matters?
This research is important because it offers a way to efficiently teach robots complex skills using the vast amount of human activity data available. Instead of needing a lot of expensive and time-consuming robot-specific training data, we can leverage what humans already know. This could lead to more capable and adaptable robots that can assist us in a wider range of tasks, and it’s a step towards robots that can learn and generalize like humans do.
Abstract
Scaling humanoid foundation models is bottlenecked by the scarcity of robotic data. While massive egocentric human data offers a scalable alternative, bridging the cross-embodiment chasm remains a fundamental challenge due to kinematic mismatches. We introduce UniT (Unified Latent Action Tokenizer via Visual Anchoring), a framework that establishes a unified physical language for human-to-humanoid transfer. Grounded in the philosophy that heterogeneous kinematics share universal visual consequences, UniT employs a tri-branch cross-reconstruction mechanism: actions predict vision to anchor kinematics to physical outcomes, while vision reconstructs actions to filter out irrelevant visual confounders. Concurrently, a fusion branch synergies these purified modalities into a shared discrete latent space of embodiment-agnostic physical intents. We validate UniT across two paradigms: 1) Policy Learning (VLA-UniT): By predicting these unified tokens, it effectively leverages diverse human data to achieve state-of-the-art data efficiency and robust out-of-distribution (OOD) generalization on both humanoid simulation benchmark and real-world deployments, notably demonstrating zero-shot task transfer. 2) World Modeling (WM-UniT): By aligning cross-embodiment dynamics via unified tokens as conditions, it realizes direct human-to-humanoid action transfer. This alignment ensures that human data seamlessly translates into enhanced action controllability for humanoid video generation. Ultimately, by inducing a highly aligned cross-embodiment representation (empirically verified by t-SNE visualizations revealing the convergence of human and humanoid features into a shared manifold), UniT offers a scalable path to distill vast human knowledge into general-purpose humanoid capabilities.