VER: Vision Expert Transformer for Robot Learning via Foundation Distillation and Dynamic Routing

Yixiao Wang, Mingxiao Huo, Zhixuan Liang, Yushi Du, Lingfeng Sun, Haotian Lin, Jinghuan Shang, Chensheng Peng, Mohit Bansal, Mingyu Ding, Masayoshi Tomizuka

2025-10-14

VER: Vision Expert Transformer for Robot Learning via Foundation Distillation and Dynamic Routing

Summary

This paper introduces a new method called VER, which stands for Vision Expert transformer for Robot learning, to help robots learn visual tasks more effectively by combining the strengths of different pre-trained vision models.

What's the problem?

Robots often struggle to generalize what they learn from one task to another because the vision systems they use are usually good at only specific things. While using multiple vision models can help, simply combining them isn't ideal because it can be inflexible and requires a lot of re-training to teach the robot about its own environment and how it interacts with it.

What's the solution?

VER solves this by first creating a 'library' of visual experts, each taken from a different pre-trained vision model. Then, instead of retraining everything, VER only adjusts a small 'routing network' to decide which expert is most useful for a given task. This routing network is very small compared to the overall system, making it efficient. They also developed a clever way to select experts dynamically, focusing on the important parts of an image and gradually improving the selection process as the robot learns. This allows the robot to easily adapt and incorporate new knowledge about its surroundings.

Why it matters?

This research is important because it allows robots to learn more quickly and perform better on a wider range of tasks. By efficiently combining existing vision models and making it easier to adapt to new situations, VER brings us closer to robots that can reliably operate in complex, real-world environments.

Abstract

Pretrained vision foundation models (VFMs) advance robotic learning via rich visual representations, yet individual VFMs typically excel only in specific domains, limiting generality across tasks. Distilling multiple VFMs into a unified representation for policy can mitigate this limitation but often yields inflexible task-specific feature selection and requires costly full re-training to incorporate robot-domain knowledge. We propose VER, a Vision Expert transformer for Robot learning. During pretraining, VER distills multiple VFMs into a vision expert library. It then fine-tunes only a lightweight routing network (fewer than 0.4% of parameters) to dynamically select task-relevant experts from the pretrained library for downstream robot tasks. We further introduce Patchwise Expert Routing with Curriculum Top-K Annealing to improve both flexibility and precision of dynamic expert selection. Moreover, VER supports parameter-efficient finetuning for scalable expert utilization and adaptive robot-domain knowledge integration. Across 17 diverse robotic tasks and multiple policy heads, VER achieves state-of-the-art performance. We find that VER reduces large-norm outliers in task-irrelevant regions (e.g., background) and concentrates on task-critical regions. Visualizations and codes can be found in https://yixiaowang7.github.io/ver_page/.

View Paper