Bi-Level Motion Imitation for Humanoid Robots
Wenshuai Zhao, Yi Zhao, Joni Pajarinen, Michael Muehlebach
2024-10-29

Summary
This paper presents SGRv2, a new framework for teaching robots to imitate human actions more efficiently by focusing on the specific objects they interact with.
What's the problem?
Training robots to perform tasks by imitating human movements can be difficult and expensive, especially because robots have different body structures and limitations compared to humans. If robots try to copy human motions exactly, they may not be able to perform those actions effectively, leading to poor learning outcomes and wasted resources.
What's the solution?
The authors developed SGRv2, which uses a method called imitation learning that emphasizes 'action locality.' This means that the robot learns to focus on the specific objects it is interacting with and how those objects influence its movements. By doing this, SGRv2 can learn from fewer examples—only five demonstrations in some cases—and still perform well. The framework includes a generative model that helps the robot create and adjust its movements based on what it learns, ensuring that these movements are physically possible for the robot.
Why it matters?
This research is important because it makes training robots more efficient, allowing them to learn complex tasks with less data. By improving how robots imitate human actions, SGRv2 can enhance their performance in real-world applications, such as in manufacturing or service industries, where robots need to adapt quickly to different tasks.
Abstract
Imitation learning from human motion capture (MoCap) data provides a promising way to train humanoid robots. However, due to differences in morphology, such as varying degrees of joint freedom and force limits, exact replication of human behaviors may not be feasible for humanoid robots. Consequently, incorporating physically infeasible MoCap data in training datasets can adversely affect the performance of the robot policy. To address this issue, we propose a bi-level optimization-based imitation learning framework that alternates between optimizing both the robot policy and the target MoCap data. Specifically, we first develop a generative latent dynamics model using a novel self-consistent auto-encoder, which learns sparse and structured motion representations while capturing desired motion patterns in the dataset. The dynamics model is then utilized to generate reference motions while the latent representation regularizes the bi-level motion imitation process. Simulations conducted with a realistic model of a humanoid robot demonstrate that our method enhances the robot policy by modifying reference motions to be physically consistent.