HERMES: Human-to-Robot Embodied Learning from Multi-Source Motion Data for Mobile Dexterous Manipulation
Zhecheng Yuan, Tianming Wei, Langzhe Gu, Pu Hua, Tianhai Liang, Yuanpei Chen, Huazhe Xu
2025-09-01
Summary
This paper introduces HERMES, a new system that teaches robots complex hand skills by learning from how people move their hands. It focuses on getting robots to perform two-handed tasks, like manipulating objects while moving around, in real-world situations.
What's the problem?
Teaching robots to use their hands in a human-like way is really hard. Robots have lots of different ways they *could* move, making it difficult to figure out the right actions. Also, what a robot learns in a simulated environment often doesn't work well when it's actually in the real world because things are different. Finally, robots need to be able to see where they're going and manipulate objects at the same time, which is a complex coordination problem.
What's the solution?
The researchers developed HERMES, which uses a type of artificial intelligence called reinforcement learning to translate human hand movements into robot actions. It can take data from different sources showing how people move their hands and turn that into instructions for the robot. To deal with the difference between simulation and reality, they created a method that uses depth images to help the robot adapt. They also improved the robot's ability to navigate and manipulate objects simultaneously by adding a system that precisely tracks where the robot is looking and what it's trying to reach.
Why it matters?
This work is important because it brings robots closer to being able to help us with everyday tasks that require skillful hand movements. By learning from humans and adapting to real-world conditions, HERMES allows robots to perform complex actions in unpredictable environments, opening up possibilities for robots to assist in homes, workplaces, and other settings.
Abstract
Leveraging human motion data to impart robots with versatile manipulation skills has emerged as a promising paradigm in robotic manipulation. Nevertheless, translating multi-source human hand motions into feasible robot behaviors remains challenging, particularly for robots equipped with multi-fingered dexterous hands characterized by complex, high-dimensional action spaces. Moreover, existing approaches often struggle to produce policies capable of adapting to diverse environmental conditions. In this paper, we introduce HERMES, a human-to-robot learning framework for mobile bimanual dexterous manipulation. First, HERMES formulates a unified reinforcement learning approach capable of seamlessly transforming heterogeneous human hand motions from multiple sources into physically plausible robotic behaviors. Subsequently, to mitigate the sim2real gap, we devise an end-to-end, depth image-based sim2real transfer method for improved generalization to real-world scenarios. Furthermore, to enable autonomous operation in varied and unstructured environments, we augment the navigation foundation model with a closed-loop Perspective-n-Point (PnP) localization mechanism, ensuring precise alignment of visual goals and effectively bridging autonomous navigation and dexterous manipulation. Extensive experimental results demonstrate that HERMES consistently exhibits generalizable behaviors across diverse, in-the-wild scenarios, successfully performing numerous complex mobile bimanual dexterous manipulation tasks. Project Page:https://gemcollector.github.io/HERMES/.