X-Humanoid: Robotize Human Videos to Generate Humanoid Videos at Scale
Pei Yang, Hai Ci, Yiren Song, Mike Zheng Shou
2025-12-12
Summary
This paper focuses on making it easier to train AI for humanoid robots by creating a lot of realistic training videos of robots doing human-like actions.
What's the problem?
Training AI to control humanoid robots is hard because you need tons of data showing the robot performing different tasks. Existing methods for creating this data from human videos don't work well when the camera isn't directly from the robot's point of view, or when parts of the robot are hidden from view. Simply sticking a robot arm onto existing videos isn't enough for complex movements.
What's the solution?
The researchers developed a new technique called X-Humanoid that uses AI to realistically transform videos of humans into videos of humanoid robots. They started with a powerful video editing AI and then trained it specifically to make this human-to-robot conversion. To get enough training data for this AI, they built a system using a game engine to create over 17 hours of paired human and robot videos. They then used this trained AI to process a large collection of videos taken from a robot's perspective, creating a massive dataset of over 3.6 million robot video frames.
Why it matters?
This work is important because it provides a way to generate the large amounts of diverse data needed to train advanced AI for humanoid robots. By creating a more realistic and comprehensive dataset, it helps robots learn to move and interact with the world more effectively, bringing us closer to having truly intelligent and capable humanoid robots.
Abstract
The advancement of embodied AI has unlocked significant potential for intelligent humanoid robots. However, progress in both Vision-Language-Action (VLA) models and world models is severely hampered by the scarcity of large-scale, diverse training data. A promising solution is to "robotize" web-scale human videos, which has been proven effective for policy training. However, these solutions mainly "overlay" robot arms to egocentric videos, which cannot handle complex full-body motions and scene occlusions in third-person videos, making them unsuitable for robotizing humans. To bridge this gap, we introduce X-Humanoid, a generative video editing approach that adapts the powerful Wan 2.2 model into a video-to-video structure and finetunes it for the human-to-humanoid translation task. This finetuning requires paired human-humanoid videos, so we designed a scalable data creation pipeline, turning community assets into 17+ hours of paired synthetic videos using Unreal Engine. We then apply our trained model to 60 hours of the Ego-Exo4D videos, generating and releasing a new large-scale dataset of over 3.6 million "robotized" humanoid video frames. Quantitative analysis and user studies confirm our method's superiority over existing baselines: 69% of users rated it best for motion consistency, and 62.1% for embodiment correctness.