IGOR: Image-GOal Representations are the Atomic Control Units for Foundation Models in Embodied AI

Xiaoyu Chen, Junliang Guo, Tianyu He, Chuheng Zhang, Pushi Zhang, Derek Cathera Yang, Li Zhao, Jiang Bian

2024-11-05

IGOR: Image-GOal Representations are the Atomic Control Units for Foundation Models in Embodied AI

Summary

This paper introduces IGOR, which stands for Image-GOal Representations. It aims to create a unified way for both humans and robots to understand and perform actions based on visual information. By doing this, IGOR helps in transferring knowledge between human activities and robot tasks.

What's the problem?

In robotics and AI, there is a challenge in teaching models to understand actions consistently across different types of entities, like humans and robots. Traditional methods often struggle to connect visual changes in images with the actions that need to be taken, making it difficult for robots to learn from human behavior or vice versa.

What's the solution?

To solve this problem, IGOR compresses the visual differences between an initial image and its goal state into what are called latent actions. This allows the system to create a shared action space that both humans and robots can use. IGOR includes components that align visual prompts with user instructions, compress visual sequences, and extend context for longer prompts. This means that IGOR can learn from videos on the internet and apply those learned actions across various tasks, enabling better interaction between humans and robots.

Why it matters?

This research is important because it opens new possibilities for teaching robots how to perform tasks by learning from human actions. By creating a system that can generalize across different tasks and entities, IGOR enhances the potential for robots to assist humans more effectively in everyday tasks, which could lead to advancements in fields like robotics, automation, and human-robot collaboration.

Abstract

We introduce Image-GOal Representations (IGOR), aiming to learn a unified, semantically consistent action space across human and various robots. Through this unified latent action space, IGOR enables knowledge transfer among large-scale robot and human activity data. We achieve this by compressing visual changes between an initial image and its goal state into latent actions. IGOR allows us to generate latent action labels for internet-scale video data. This unified latent action space enables the training of foundation policy and world models across a wide variety of tasks performed by both robots and humans. We demonstrate that: (1) IGOR learns a semantically consistent action space for both human and robots, characterizing various possible motions of objects representing the physical interaction knowledge; (2) IGOR can "migrate" the movements of the object in the one video to other videos, even across human and robots, by jointly using the latent action model and world model; (3) IGOR can learn to align latent actions with natural language through the foundation policy model, and integrate latent actions with a low-level policy model to achieve effective robot control. We believe IGOR opens new possibilities for human-to-robot knowledge transfer and control.

View Paper