A Data-Centric Revisit of Pre-Trained Vision Models for Robot Learning

Xin Wen, Bingchen Zhao, Yilun Chen, Jiangmiao Pang, Xiaojuan Qi

2025-03-11

A Data-Centric Revisit of Pre-Trained Vision Models for Robot Learning

Summary

This paper talks about improving how robots learn by using better pre-trained vision models that focus on recognizing objects, even when trained on messy real-world data that doesn’t always center on single objects.

What's the problem?

Current vision models used in robots work well on clean, single-object data but struggle with cluttered scenes or mixed data, making robots less adaptable in real-world tasks like grabbing items from a crowded table.

What's the solution?

SlotMIM forces models to learn object-focused features by simplifying how they process images and teaching them consistency across different views, helping robots identify objects better in complex environments.

Why it matters?

This makes robots smarter at tasks like sorting, cleaning, or assisting in homes and factories by improving their ability to recognize objects in messy, real-world settings without needing perfect data.

Abstract

Pre-trained vision models (PVMs) are fundamental to modern robotics, yet their optimal configuration remains unclear. Through systematic evaluation, we find that while DINO and iBOT outperform MAE across visuomotor control and perception tasks, they struggle when trained on non-(single-)object-centric (NOC) data--a limitation strongly correlated with their diminished ability to learn object-centric representations. This investigation indicates that the ability to form object-centric representations from the non-object-centric robotics dataset is the key to success for PVMs. Motivated by this discovery, we designed SlotMIM, a method that induces object-centric representations by introducing a semantic bottleneck to reduce the number of prototypes to encourage the emergence of objectness as well as cross-view consistency regularization for encouraging multiview invariance. Our experiments encompass pre-training on object-centric, scene-centric, web-crawled, and ego-centric data. Across all settings, our approach learns transferrable representations and achieves significant improvements over prior work in image recognition, scene understanding, and robot learning evaluations. When scaled up with million-scale datasets, our method also demonstrates superior data efficiency and scalability. Our code and models are publicly available at https://github.com/CVMI-Lab/SlotMIM.

View Paper