Learning from Massive Human Videos for Universal Humanoid Pose Control

Jiageng Mao, Siheng Zhao, Siqi Song, Tianheng Shi, Junjie Ye, Mingtong Zhang, Haoran Geng, Jitendra Malik, Vitor Guizilini, Yue Wang

2024-12-19

Learning from Massive Human Videos for Universal Humanoid Pose Control

Summary

This paper introduces Humanoid-X, a large dataset designed to help humanoid robots learn how to move and perform tasks by using videos of human actions. By utilizing this extensive data, the researchers aim to improve how robots can understand and execute commands based on text descriptions.

What's the problem?

Traditional methods for teaching robots to control their bodies often rely on reinforcement learning or direct human control, which can be limited by the variety of training environments and the high cost of collecting demonstration data. This makes it hard for robots to learn effectively and adapt to real-world situations.

What's the solution?

The authors created the Humanoid-X dataset, which contains over 20 million different poses of humanoid robots along with descriptions of the motions. They developed a system that collects data from human videos, generates captions for those videos, and then teaches robots how to replicate those movements. This allows a large humanoid model, named UH-1, to understand text instructions and perform the corresponding actions more accurately.

Why it matters?

This research is significant because it opens up new ways for humanoid robots to learn from real human behavior, making them more adaptable for various tasks in real life. By improving how robots can interpret and act on instructions, this work could lead to better performance in areas like service robotics, healthcare, and even entertainment.

Abstract

Scalable learning of humanoid robots is crucial for their deployment in real-world applications. While traditional approaches primarily rely on reinforcement learning or teleoperation to achieve whole-body control, they are often limited by the diversity of simulated environments and the high costs of demonstration collection. In contrast, human videos are ubiquitous and present an untapped source of semantic and motion information that could significantly enhance the generalization capabilities of humanoid robots. This paper introduces Humanoid-X, a large-scale dataset of over 20 million humanoid robot poses with corresponding text-based motion descriptions, designed to leverage this abundant data. Humanoid-X is curated through a comprehensive pipeline: data mining from the Internet, video caption generation, motion retargeting of humans to humanoid robots, and policy learning for real-world deployment. With Humanoid-X, we further train a large humanoid model, UH-1, which takes text instructions as input and outputs corresponding actions to control a humanoid robot. Extensive simulated and real-world experiments validate that our scalable training approach leads to superior generalization in text-based humanoid control, marking a significant step toward adaptable, real-world-ready humanoid robots.

View Paper