Multimodal Autoregressive Pre-training of Large Vision Encoders

Enrico Fini, Mustafa Shukor, Xiujun Li, Philipp Dufter, Michal Klein, David Haldimann, Sai Aitharaju, Victor Guilherme Turrisi da Costa, Louis Béthune, Zhe Gan, Alexander T Toshev, Marcin Eichner, Moin Nabi, Yinfei Yang, Joshua M. Susskind, Alaaeldin El-Nouby

2024-11-22

Multimodal Autoregressive Pre-training of Large Vision Encoders

Summary

This paper presents AIMV2, a new method for pre-training large vision encoders that can understand both images and text, improving their performance on various tasks.

What's the problem?

While existing vision models are good at processing images, they often struggle to integrate information from text and images effectively. This limits their ability to perform well in tasks that require understanding both types of data, such as image captioning or visual question answering.

What's the solution?

AIMV2 addresses this issue by using a multimodal approach that combines a vision encoder with a decoder capable of generating both image patches and text tokens. This allows the model to learn from both images and text simultaneously. The training process is designed to be straightforward and scalable, meaning it can handle large amounts of data easily. AIMV2 has shown significant improvements in performance on various benchmarks, achieving high accuracy in tasks like image classification and object detection.

Why it matters?

This research is important because it enhances how AI systems can understand and generate information from both images and text. By improving the capabilities of vision models, AIMV2 opens up new possibilities for applications in areas like autonomous vehicles, healthcare imaging, and interactive AI systems, making them more effective in real-world scenarios.

Abstract

We introduce a novel method for pre-training of large-scale vision encoders. Building on recent advancements in autoregressive pre-training of vision models, we extend this framework to a multimodal setting, i.e., images and text. In this paper, we present AIMV2, a family of generalist vision encoders characterized by a straightforward pre-training process, scalability, and remarkable performance across a range of downstream tasks. This is achieved by pairing the vision encoder with a multimodal decoder that autoregressively generates raw image patches and text tokens. Our encoders excel not only in multimodal evaluations but also in vision benchmarks such as localization, grounding, and classification. Notably, our AIMV2-3B encoder achieves 89.5% accuracy on ImageNet-1k with a frozen trunk. Furthermore, AIMV2 consistently outperforms state-of-the-art contrastive models (e.g., CLIP, SigLIP) in multimodal image understanding across diverse settings.

View Paper