Theia: Distilling Diverse Vision Foundation Models for Robot Learning
Jinghuan Shang, Karl Schmeckpeper, Brandon B. May, Maria Vittoria Minniti, Tarik Kelestemur, David Watkins, Laura Herlant
2024-07-30

Summary
This paper introduces Theia, a new model designed to help robots learn how to interpret visual information better by combining knowledge from multiple existing vision models. This allows robots to perform a variety of tasks more effectively.
What's the problem?
Robots need to understand complex visual information to perform tasks, but many traditional models focus on specific tasks like identifying objects or recognizing patterns. This limits their ability to learn from diverse visual inputs, making it hard for robots to adapt to different environments and situations.
What's the solution?
Theia addresses this problem by distilling knowledge from several off-the-shelf vision foundation models into a single model. It combines the strengths of these models, which have been trained on various visual tasks, to create rich visual representations that help robots learn better. Theia has been tested extensively and has shown to outperform previous models while using less training data and smaller model sizes. It also incorporates a feature translation process that improves the quality of the information used for robot learning.
Why it matters?
This research is important because it enhances how robots can learn from their surroundings, making them more versatile and effective in performing tasks. By improving robot learning capabilities, Theia can contribute to advancements in robotics, such as better automation in industries, improved assistive technologies, and smarter robotic systems that can adapt to new challenges.
Abstract
Vision-based robot policy learning, which maps visual inputs to actions, necessitates a holistic understanding of diverse visual tasks beyond single-task needs like classification or segmentation. Inspired by this, we introduce Theia, a vision foundation model for robot learning that distills multiple off-the-shelf vision foundation models trained on varied vision tasks. Theia's rich visual representations encode diverse visual knowledge, enhancing downstream robot learning. Extensive experiments demonstrate that Theia outperforms its teacher models and prior robot learning models using less training data and smaller model sizes. Additionally, we quantify the quality of pre-trained visual representations and hypothesize that higher entropy in feature norm distributions leads to improved robot learning performance. Code and models are available at https://github.com/bdaiinstitute/theia.